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^ . Abstract 

D , Modeling community formation and detecting hidden communities in networks is a well studied 

[JL^ ' problem. However, theoretical analysis of community detection has been mostly limited to models 

' with non-overlapping communities such as the stochastic block model. In this paper, we remove this 

restriction, and consider a family of probabilistic network models with overlapping communities, termed 
as the mixed membership Dirichlet model, first introduced in [2]. This model allows for nodes to have 
fractional memberships in multiple communities and assumes that the community memberships are 
drawn from a Dirichlet distribution. We propose a unified approach to learning these models via a tensor 
' spectral decomposition method. Our estimator is based on low-order moment tensor of the observed 

network, consisting of 3-star counts. Our learning method is fast and is based on simple linear algebra 
^ ■ operations, e.g. singular value decomposition and tensor power iterations. We provide guaranteed 

recovery of community memberships and model parameters and present a careful finite sample analysis 
of our learning method. Additionally, our results match the best known scaling requirements in the 
^ special case of the stochastic block model. 

00 ' Keywords: Community detection, spectral methods, tensor methods, moment-based estimation, mixed 

, membership models. 

O ■ 1 Introduction 
cn 

Studying communities forms an integral part of social network analysis. A community generally refers to 
a group of individuals with shared interests (e.g. music, sports), or relationships (e.g. friends, co-workers). 
, . , Community formation in social networks has been studied by many sociologists, e.g. [33, 29, 31, 14], starting 

r> ■ with the seminal work of Moreno [33] . They posit various factors such as homophily^ among the individuals 

I to be responsible for community formation. Various (probabilistic and non-probabilistic) network models 

attempt to explain community formation. In addition, they also attempt to quantify interactions and the 
extent of overlap between different communities, relative sizes among the communities, and various other 
network properties. Studying such community models are also of interest in other domains, e.g. in biological 
networks. 

While there exists a vast literature on community models, learning these models is typically challenging, 
and various heuristics such as Markov Chain Monte Carlo (MCMC) or variational expectation maximization 
(EM) are employed in practice. Such heuristics tend to be unreliable and scale poorly for large networks. 
On the other hand, community models with guaranteed learning methods tend to be restrictive. A popular 
class of probabilistic models, termed as stochastic blockmodcls, have been widely studied and enjoy strong 
theoretical learning guarantees, e.g. [42, 23, 16, 41, 37, 32]. On the other hand, they posit that an individual 
belongs to a single community, which does not hold in most real settings [34]. 



^The term homophily refers to the tendency that individuals belonging to the same community tend to connect more than 
individuals in different communities. 
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In this paper, we consider a class of mixed membership commimity models, originally introchiccd by 
Airoldi et. al. [2], and recently employed in [43, 21]. The model has been shown to be effective in many real- 
world settings, but so far, no learning approach exists with provable guarantees, and in practice, learning is 
carried out through Gibbs sampling or through variational Baycs. In this paper, we provide a novel learning 
approach for learning mixed membership models with provable guarantees. 

The mixed membership community model of [2] has a number of attractive properties. It retains many of 
the convenient properties of the stochastic block model. For instance, conditional independence of the edges 
is assumed, given the community memberships of the nodes in the network. At the same time, it allows 
for communities to overlap, and for every individual to be fractionally involved in different communities. 
It includes the stochastic block model as a special case (corresponding to zero overlap among the differ- 
ent communities). This enables us to compare our learning guarantees with existing works for stochastic 
block models and also study how the extent of overlap among different communities affects the learning 
performance. 

1.1 Summary of Results 

We now summarize the main contributions of this paper. We propose a novel approach for learning mixed 
membership community models of [2, 43, 21]. Our approach is a method of moments estimator and incor- 
porates tensor (spectral) decomposition. We provide guarantees for our approach under a set of sufficient 
conditions. Finally, we compare our results to existing ones for the special case of the stochastic block model, 
where nodes belong to a single community. 

We present a unified approach for the mixed membership model of [2]. The extent of overlap between 
different communities in this model class is controlled (roughly) through a single scalar parameter, termed 
as the Dirichlet concentration parameter ao := J2i when the community vectors are drawn from the 
Dirichlet distribution Dir(Q!). When ^ 0, the mixed membership model degenerates to a stochastic block 
model. We propose a unified learning method and provide recovery guarantees under a set of sufficient 
conditions. We provide explicit scaling requirements in terms of the extent of community overlaps (through 
ao), the network size n, the number of communities k, and the average edge connectivity across various 
communities. For instance, for the special case, where p is the probability for any intra-community edge to 
occur, and q corresponds to the inter- community connectivity, and the average community sizes are equal, 
we require that^ 



Thus, we require n to be large enough compared to the number of communities fc, and for the separation p—g 
to be large enough, so that the learning method can distingiiish the different communities. Moreover, we see 
that the scaling requirements become more stringent as qq increases. This is intuitive since it is harder to 
learn communities with more overlap, and we quantify this scaling. Moreover, we quantify the error bounds 
for estimating various parameters of the mixed membership model. Lastly, wc establish zero-error guarantees 
for support recovery: our learning method correctly identifies (w.h.p) all the significant memberships of a 
node and also identifies the set of communities where a node does not have a strong presence, and we quantify 
these thresholds depending on ao- 

For the special case of stochastic block models (ao 0), (2) reduces to 



The scaling requirements in (2) match with the best known bounds (up to poly-log factors) and were previ- 
ously achieved by [44] via convex optimization involving semi-definite programming (SDP). In contrast, we 
propose an iterative non-convex approach involving tensor power iterations and linear algebraic techniques, 

^The notation n(-),0(-) denotes f2(-),0(-) up to log factors. 





(1) 





(2) 
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and obtain similar guarantees. For a detailed comparison of learning guarantees under various methods for 
learning stochastic block models, see [44]. 

Thus, we establish learning guarantees explicitly in terms of the extent of overlap among the differ- 
ent communities. Many real-world networks involve sparse community memberships and the total number 
of communities is typically much larger than the extent of membership of a single individual, e.g. hob- 
bies/interests of a person, university/company networks that a person belongs to, the set of transcription 
factors regulating a gene, and so on. Thus, we see that in this regime of practical interest, where — 0(1), 
the scaling requirements in (1) match those for the stochastic block model in (2) (up to polylog factors) 
without any degradation in learning performance. Thus, we establish that learning community models with 
sparse community memberships is akin to learning stochastic block models and we present a unified approach 
and analysis for learning these models. To the best of our knowledge, this work is the first to establish polyno- 
mial time learning guarantees for probabilistic network models with overlapping communities and we provide 
a fast and an iterative learning approach through linear algebraic techniques and tensor power iterations. 

1.2 Overview of Techniques 

We now describe the main techniques employed in our learning approach and in establishing the recovery 
guarantees. 

Method of moments and subgraph counts: We propose an efficient learning algorithm based on low 
order moments, viz., counts of small subgraphs. Specifically, we employ a third-order tensor which counts 

the number of 3-stars in the observed network. A 3-star is a star graph with three leaves (see figure 1) and 
we count the occurrences of such 3-stars across different partitions. We establish that (an adjusted) 3-star 
count tensor has a simple relationship with the model parameters, when the network is drawn from a mixed 
membership model. In particular, we propose a multi-linear transformation (also termed as whitening) 
under which the canonical polyadic ( CP) decomposition of the tensor yields the model parameters and the 
community vectors. Note that the decomposition of a general tensor into its rank-one components is referred 
to as its CP decomposition [27] and is in general NP-hard [22]. However, we reduce the learning problem 
to an orthogonal symmetric tensor decomposition, for which tractable decomposition exists, as described 
below. 

Tensor spectral decomposition via power iterations: Our tensor decomposition method is based on 
the popular power iterations (e.g. see [3]). It is a simple iterative method to compute the stable eigen-pairs 
of a tensor. In this paper, we propose various modifications to the basic power method to strengthen the 
recovery guarantees under perturbations. For instance, we introduce novel adaptive deflation techniques 
(which involves subtracting out the eigen-pairs which are previously estimated). Moreover, we optimize 
performance for the regime where the community overlaps are small. We initialize the tensor power method 
with (whitened) neighborhood vectors from the observed network. In the regime, where the community 
overlaps are small, this leads to an improved performance compared to random initialization. Additionally, 
we incorporate thresholding as a post-processing operation, which again, leads to improved guarantees for 
sparse community memberships, i.e., when the overlap among different communities is small. 

Sample analysis: We establish that our learning approach correctly recovers the model parameters and 
the community memberships of all nodes under exact moments. We then carry out a careful analysis of 
the empirical graph moments, computed using the network observations. We establish tensor concentration 
bounds and also control the perturbation of the various quantities used by our learning algorithm via matrix 
Bernstein's inequality [40, thm. 1.4] and other inequalities. We impose the scaling requirements in (1) for 
various concentration bounds to hold. 
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1.3 Related Work 



There is extensive work on modeling communities and various algorithms and heuristics for discovering them. 
We mostly limit our focus to works with theoretical guarantees. 

Method of moments: The method of moments approach dates back to Pearson [35] and has been applied 
for learning various community models. Here, the moments correspond to counts of various subgraphs in 
the network. They typically consist of aggregate quantities, e.g., number of star subgraphs, triangles etc. in 
the network. For instance, Bickel et al [9] analyze the moments of a stochastic block model and establish 
that the subgraph counts of certain structures, termed as "wheels" (a family of trees), are sufficient for 
identifiability under some natural non- degeneracy conditions. In contrast, we establish that moments up to 
third order (corresponding to edge and 3-star counts) are sufficient for identifiability of the stochastic block 
model, and also more generally, for the mixed membership Dirichlet model. We employ subgraph count 
tensors, corresponding to the number of subgraphs (such as stars) over a set of labeled vertices, while the 
work in [9] considers only aggregate (i.e. scalar) counts. Considering tensor moments allows us to use simple 
subgraphs (edges and 3 stars) corresponding to low order moments, rather than more complicated graphs 
(e.g. wheels considered in [9]) with larger number of nodes, for learning the community model. 

The method of moments is also relevant in the context of a family of random graph models termed as 
exponential random graph models [24, 17]. Subgraph counts of fixed graphs such as stars and triangles serve 
as sufficient statistics for these models. However, parameter estimation given the subgraph counts is in 
general NP-hard, due to the normalization constant in the likelihood (the partition function) and the model 
suffers from degeneracy issues; see [36, 13] for detailed discussion. In contrast, we establish in this paper that 
the mixed membership model is amenable to simple estimation methods through linear algebraic operations 
and tensor power iterations using subgraph counts of 3-stars. 

Stochastic block models: Many algorithms provide learning guarantees for stochastic block models. 
For a detailed comparison of these methods, see the recent work in [44]. A popular method is based on 

spectral clustering [32], where community memberships are inferred through projection onto the spectrum 
of the Laplacian matrix (or its variants). This method is fast and easy to implement (via singular value 
decomposition). On the other hand, its theoretical learning guarantees are worse compared to the work 
of [44] , which uses convex optimization techniques via semi-definite programming. For a detailed comparison 
of learning guarantees under various methods for learning stochastic block models, see [44]. 

Non-probabilistic approaches: The classical approach to community detection tries to directly exploit 
the properties of the graph to define communities, without assuming a probabilistic model. Girvan and 
Newman [20] use betweenness to remove edges until only communities are left. However, Bickel and Chen [8] 
show that these algorithms arc (asymptotically) biased and that using modularity scores can lead to the 
discovery of an incorrect community structure, even for large graphs. Jalali et al [25] define community 
structure as the structure that satisfies the maximum number of edge constraints (whether two individuals 
like/dislike each other). However, these models assume that every individual belongs to a single community. 

Recently, some non-probabilistic approaches have been introduced with overlapping community models 
by Arora et al [6] and Balcan et al [7]. The analysis of Arora et al [6] is mostly limited to dense graphs (i.e. 
O(n^) edges for a n node graph), while our analysis provides learning guarantees for much sparser graphs (as 
seen by the scaling requirements in (1)). Moreover, the running time of the method in [6] is quasipolynomial 
time (i.e. 0(n^°s")) for the general case, and is based on a combinatorial learning approach. In contrast, 
our learning approach is based on simple linear algebraic techniqiies and the running time is a low-order 
polynomial (roughly it is 0{n^k) for a n node network with k communities). The work of [7] assumes 
endogenous formed communities, by constraining the fraction of edges within a community compared to the 
outside. They provide a polynomial time algorithm for finding all such "self-determined" communities and 
the running time is n'-'^^°^^^°'^^" , where a is the fraction of edges within a self-determined community, and 
this bound is improved to linear time when a > 1/2. On the other hand, the running time of our algorithm is 
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mostly independent of the parameters of the assumed model, (and is roughly O(n^fc)). Moreover, both these 
works are limited to homophilic models, where there are more edges within each community, than between 
any two different communities. However, our learning approach is not limited to this setting and also does not 
assume homogeneity in edge connectivity across different communities (while indeed it makes probabilistic 
assumptions on community formation). In addition, we provide improved guarantees for homophilic models 
by considering additional post-processing steps in our algorithm. Recently, Abraham et al [1] provide an 
algorithm for approximating the parameters of an Euclidean log-linear model in polynomial time. However, 
there setting is considerably different than the one in this paper. 

Inhomogeneous random graphs, graph limits and weak regularity lemma: Inhomogeneous 
random graphs have been analyzed in a variety of settings (e.g., [11, 30]) and are generalizations of the 
stochastic block model. Here, the probability of an edge between any two nodes is characterized by a general 
function (rather than by a fc x fc matrix as in the stochastic block model with k blocks) . Note that the mixed 
membership model considered in this work is a special instance of this general framework. These models 
arise as the limits of convergent (dense) graph sequences and for this reason, the functions are also termed as 
"graphons" or graph limits [30]. A deep result in this context is the regularity lemma and its variants. The 
weak regularity lemma proposed by Frieze and Kannan [18], showed that any convergent dense graph can be 
approximated by a stochastic block model. Moreover, they propose an algorithm to learn such a block model 
based on the so-called d2 distance. The d2 distance between two nodes measures similarity with respect to 
their "two-hop" neighbors and the block model is obtained by thresholding the d2 distances. However, the 
method is limited to learning block models and not overlapping communities. 

Learning Latent Variable Models (Topic Models): We leverage the recent developments from [5, 3, 
4] for learning topic models and other latent variable models based on the method of moments. Topic models 
have been popular in document modeling [10], and they posit that the words in a document are generated 
through multiple latent topics in a document. The works in [5, 3, 4] consider learning these models from 
second- and third-order observed moments through linear algebraic and tensor-based techniques. We exploit 
the tensor power iteration method of [4] and provide additional improvements to obtain stronger recovery 
guarantees. Moreover, the sample analysis is quite different in the community setting compared to other 
latent variable models analyzed in [5, 3, 4]. 

Relationship between community detection and tensor decomposition: There are works relating 
the hardness of finding hidden cliques and the use of higher order moment tensors for this purpose. Frieze 
and Kannan [19] relate the problem of finding a hidden clique to finding the top eigenvector of the third 
order tensor, corresponding to the maximum spectral norm. However, this problem is known to be NP-hard 
in general [22]. Brubaker and Vempala [12] extend the result to arbitrary r"'-order tensors. The work in [15] 
provides lower bounds on the complexity of statistical algorithms, i.e., those with access to moments (rather 
than actual samples from the distribution) and show that the cliques have to be size fl{n^/^) to enable 
recovery from r*''-order moment tensors in a n node network. Thus tensors are useful for finding smaller 
hidden cliques in network (albeit by solving a computationally hard problem in general). In contrast, 
we consider tractable tensor decomposition through reduction to orthogonal symmetric tensors (under the 
scaling requirements of (1)), and our learning method is a fast and an iterative approach based on tensor 
power iterations and linear algebraic operations. 

2 Community Models and Graph Moments 
2.1 Community Membership Models 

In this section, we describe the mixed membership community model based on Dirichlet priors for the 
community draws by the individuals. We first introduce the special case of the popular stochastic block 
model, where each node belongs to a single community. 
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Notation: Wc consider stochastic network models with n nodes and let [n] := {l,2,...,n}. Let G be 
the {0, 1} adjacency'^ matrix for the random network and let Ga,b be the submatrix of G corresponding to 
rows A C [n] and columns B C [n]. We consider models with k underlying (hidden) communities. For node 
i, let TTi €E M*^ denote its community membership vector, i.e., the vector is supported on the communities to 
which the node belongs. In the special case of the popular stochastic block model described below, ttj is a 
basis coordinate vector, while the more general mixed membership model relaxes this assumption and a node 
can be in multiple communities with fractional memberships. Define 11 :— [7ri|7r2| • • • |7r„] € R*^^". and let 
U-A ■= [ni : i G A] G R'^^l"*! denote the set of column vectors restricted to A C [n]. For a matrix M, let (M)j 
and {My denote its i*"" column and row respectively. For a matrix M with singular value decomposition 
(SVD) M = UDV^, let {M)k^svd UDV^ denote the fc-rank SVD of M, where D is limited to top-fc 
singular values of M. Let denote the MoorePenrose pseudo-inverse of M. Let I(-) be the indicator 
function. We use the term high probability to mean with probability 1 — for any constant c > 0. 

Stochastic block model (special case): In this model, each individual is independently assigned to 
a single community, chosen at random: each node i chooses community j independently with probability 
cij, for i G [n],j G [k], and we assign tt, = ej in this case, where ej e {0. 1}''' is the j"' coordinate basis 
vector. Given the community assignments 11, every directed** edge in the network is independently drawn: 
if node u is in community i and node v is in community j (and u ^ v), then the probability of having 
the edge {u,v) in the network is Pij. Here, P € [0, l]*^^*^ and we refer to it as the community connectivity 
matrix. This implies that given the community membership vectors 7r„ and 7r„, the probability of an edge 
from M to w is P-Ky (since when 7r„ = Cj and Wy = Cj, we have nJPny — Pij.)- The stochastic model has 
been extensively studied and can be learnt efficiently through various methods, e.g. spectral clustering [32], 
convex optimization [44] . and so on. Many of these methods rely on conditional independence assumptions 
of the edges in the block model for guaranteed learning. 

Mixed membership model: We now consider the extension of the stochastic block model which allows 
for an individual to belong to multiple communities and yet preserves some of the convenient independence 
assumptions of the block model. In this model, the community membership vector tt^ at node u is a 
probability vector, i.e., X]ie[fc] ^ti(*) ^ 1' ^'^^ ^ ^ Given the community membership vectors, the 
generation of the edges is identical to the block model: given vectors 7r„ and tt^,, the probability of an 
edge from u to w is tt^Ptt^, and the edges are independently drawn. Under this formulation, given the 
community vectors 11, a random network can be generated as follows: for each node pair u,v, a community 
pair i,j is drawn independently from their community membership vectors, i.e. i ^ tt^ and j ^ tt^,, and 
the edge {u,v) is drawn independently with probability Pij- This formulation allows for the nodes to be in 
multiple communities, and at the same time, preserves the conditional independence of the edges, given the 
community memberships of the nodes. 

Dirichlet prior for community membership: The only aspect left to be specified for the mixed 
membership model is the distribution from which the community membership vectors 11 are drawn. We 
consider the popular setting of [2, 21], where the community vectors {tt^} are i.i.d. draws from the Dirichlet 
distribution, denoted by Dir(a), with parameter vector a G IR>o- The probability density function of the 
Dirichlet distribution is given by 

m= ^'=f["'^ U<'-\ 7r~Dir(a),ao:=E"- (3) 
where r(-) is the Gamma function and the ratio of the Gamma functions serves as the normalization constant. 

^Our analysis can easily be extended to weighted adjacency matrices with bounded entries. 

■^We limit our discussion to directed networks in this paper, but note that the results also hold for undirected community 
models, where P is a symmetric matrix, and an edge {u,v) is formed with probability ttJ^Ptt^ = ttJPtTu. 
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Figure 1: Our moment-based learning algorithm uses 3-star count tensor from partition X to partitions 
A, B, C (and the roles of the partitions are interchanged to get various estimates). Specifically, T is a third 
order tensor, where T(m, v, w) is the normalized count of the 3-stars with u, w as leaves over all x & X. 

The Dirichlet distribution is widely employed for specifying priors in Bayesian statistics, e.g. latent 
Dirichlet allocation [10]. The Dirichlet distribution is the conjugate prior of the multinomial distribution 
which makes it attractive for Bayesian inference. 

Let a denote the normalized parameter vector a/ao, where ao := '^i- particular, note that a is a 
probability vector: ~ 1- Intuitively, a denotes the relative expected sizes of the communities (since 

J2ue[n] ""f W] — Q^i)- L<5t Smax be the largest entry in S, and amin be the smallest entry. Our learning 
guarantees will depend on these parameters. 

The stochastic block model is a limiting case of the mixed membership model when the Dirichlet pa- 
rameter is a = ao ■ a, where the probability vector a is held fixed and ao ^ 0. In the other extreme when 
ao — oo, the Dirichlet distribution becomes peaked around a single point, for instance if a^ = c and c — oo, 
the Dirichlet distribution is peaked at • 1, where 1 is the all-ones vector. Thus, the parameter ao serves 
as a measure of the average sparsity of the Dirichlet draws or equivalently, of how concentrated the Dirichlet 
measure is along the different coordinates. This in effect, controls the extent of overlap among different 
communities. 

Sparse regime of Dirichlet distribution: When the Dirichlet parameter vector satisfies^ a^ < 1, 
for all i G [k], the Dirichlet distribution Dir(a) generates "sparse" vectors with high probability^; see [39] 
(and in the extreme case of the block model where ao 0, it generates 1-sparse vectors). Many real-world 
settings involve sparse community membership and the total number of communities is typically much larger 
than the extent of membership of a single individual, e.g. hobbies/interests of a person, university /company 
networks that a person belongs to, the set of transcription factors regulating a gene, and so on. Our learning 
guarantees are limited to the sparse regime of the Dirichlet model. 

2.2 Graph Moments Under Mixed Membership Models 

Our approach for learning a mixed membership community model relies on the form of the graph moments'^ 
under the mixed membership model. We now describe the specific graph moments used by our learning 
algorithm (based on 3-star and edge counts) and provide explicit forms for the moments, assuming draws 
from a mixed membership model. 

Notations 

Recall that G denotes the adjacency matrix and that Gx.a denotes the submatrix corresponding to edges 
going from X to A. Recall that P E [0, l]*^^*^ denotes the community connectivity matrix. Define 

F:=n^pT = [7ri|^2|---|7r„]^P^. (4) 

^The assumption that the Dirichlet distribution be in the sparse regime is not strictly needed. Our results can be extended 
to general Dirichlet distributions, but with worse scaling requirements on the network size n for guaranteed learning. 

^Roughly the number of entries in tt exceeding a threshold r is at most O{oo log(l/''")) with high probability, when n ~ Dir(o). 
^We interchangeably use the term first order moments for edge counts and third order moments for 3-star counts. 
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For a subset A C [n] of individuals, let Fa G RI'^I^*^ denote the submatrix of F corresponding to nodes in 
A, i.e., Fa ■= II^P^. Let Diag(z;) denote a diagonal matrix with diagonal entries given by a vector v. 

Our learning algorithm uses moments up to the third-order, represented as a tensor. A third-order tensor 
T is a three-dimensional array whose {p, q, r)-th entry denoted by Tp,g,r- The symbol denotes the standard 
Kronecker product: if u, v, w are three vectors, then 

{u(S>V^ w)p^q^r '■= Up - Vq- Wr- (5) 

A tensor of the form v ^ w is referred to as a rank-one tensor. The decomposition of a general tensor 
into a sum of its rank-one components is referred to as canonical polyadic ( CP) decomposition [27]. We will 
subsequently see that the graph moments can be expressed as a tensor and that the CP decomposition of the 
graph-moment tensor yields the model parameters and the community vectors under the mixed membership 
community model. 

2.2.1 Graph moments under Stochastic Block Model 

We first analyze the graph moments in the special case of a stochastic block model (i.e., ao = J2i ct^ — > 
in the Dirichlet prior in (3)) and then extend it to general mixed membership model. We provide explicit 
expressions for the graph moments corresponding to edge counts and 3-star counts. We later establish in 
Section 3 that these moments are sufficient to learn the community memberships of the nodes and the model 
parameters of the block model. 

3-star counts: The primary quantity of interest is a third-order tensor which counts the number of 3- 
stars. A 3-star is a star graph with three leaves {a, b, c} and we refer to the internal node x of the star as its 
"head", and denote the structure by a; ^ {a,b,c} (see figure 1). We partition the network into four^ parts 
and consider 3-stars such that each node in the 3-star belongs to a different partition. This is necessary to 
obtain a simple form of the moments, based on the conditional independence assumptions of the block model, 
see Proposition 2.1. Specifically, consider a partition A,B,C,X of the network. We count the number of 
3-stars from X to A, B, C and our quantity of interest is 

Tx^{A,B,c} ■■=t^\T.{GIa^ ® Glc] , (6) 

' ' lex 

where ^ is the Kronecker product, defined in (5) and Gi,A is the row vector supported on the set of neighbors 
of i belonging to set A. T £ K.l-^l^l'^l^l'^l is a third order tensor, and an element of the tensor is given by 

Tx^{A,B,c}{a,b,c)^-^y2G{x,a)Gix,b)G{x,c), Va € A,b € B,c € C, (7) 

which is the normalized count of the number of 3-stars with leaves a, b, c such that its "head" is in set X. 

We now relate the tensor Tx^{a.b,c} to the parameters of the stochastic block model, viz., the commu- 
nity connectivity matrix P and the community probability vector S, where Sj is the probability of choosing 
community i. 

Proposition 2.1 (Moments in Stochastic Block Model). Given partitions A,B,C,X, and F := II^P^, 
where P is the community connectivity matrix and 11 is the matrix of community membership vectors, we 
have 

E[Gl,^|nA,nx] ^PaHx, (8) 
E[Tx^{A,B,c} |n^,nB,nc] = ^ ai{FA)i ® {FB)i ® {Fc)i, (9) 

where Si is the probability for a node to select community i. 

*For sample complexity analysis, we require dividing the graph into more than four partitions to deal with statistical 
dependency issues, and we outline it in Section 3. 
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Remark 1: Note the form of the 3-star count tensor T in (9). It provides a CP decomposition of T since 
each term in the summation, viz., aj(i^^)j (8) (-pB)i ® {Fc)i-, is a rank one tensor. Thus, we can learn the 
matrices Fa,Fb, Fq and the vector a through CP decomposition of tensor T. We can then exploit the form 
of the adjacency submatrix Gx.a in (8) to obtain Tlx, the set of community vectors of nodes in X. Similarly, 
we can consider another tensor consisting of 3-stars from A to X, B, C, and obtain matrices Fx,Fb and Fc 
through a CP decomposition, and so on. Once we obtain matrices F and 11 for the entire set of nodes in 
this manner, we can obtain the community connectivity matrix P, since F := II^P^. Thus, in principle, 
we are able to learn all the model parameters (S and P) and the community membership matrix 11 under 
the stochastic block model. This forms our basic approach for learning the community model using the 
adjacency matrix G and the 3-star count tensor T. The details are in Section 3. 

Remark 2: The main property exploited in proving the tensor form in (9) is the conditional-independence 
assumption under the stoc;liastic block model: the realization of the edges in each 3-star, say in a; ^ {a, 6, c}, 
is conditionally independent given the community membership vector tTj,, when x ^ a ^ b ^ c. This is be- 
cause the community membership vectors 11 are assumed to be drawn independently at the different nodes 
and the edges are drawn independently given the community vectors. Considering 3-stars from X to A, B, C 
where X,A,B,C form a partition ensures that this conditional independence is satisfied for all the 3-stars 
in tensor T. 

Proof: Recall that the probability of an edge from u to u given 7r„,7r^ is 

IE[Gu,w|7r„,7r„] = ttJPtt,, = ttJ P'^tTu = P„7r„, 

and E[Gx,A|nA, ^x] = nJPII^ = HJfJ and thus (8) holds. For the tensor form, first consider an element 
of the tensor, with a G A,b G B,c G C, 

E [^X^{A,B,C}{'^jf>,c)\'JTa,'JTb,'JTc,Tfx] = j"^ ^ FaTT^ ■ PfeTTx • PcTx, 

where (a) follows from the conditional- independence assumption of the edges (assuming a ^ b ^ c). Now 
taking expectation over the nodes in X, we have 

E [Tx^{A,B,C}{a,b,c)\Tra,-JTb,TTc\ = t4t ElFaTTx ■ FbTT^ ■ PcTTx] 

= E[Fan ■ FbTT ■ P,7r] 

= Y,MPah-{Fb)j-{Fc)j, 

where the last step follows from the fact that tt = ej with probability aj and the result holds when x ^ a,b, c. 
Recall that {Fa)j denotes the j*"" column of Fa (since FaCj = (Pa)j)- Collecting all the elements of the tensor, 
we obtain the desired result. □ 

2.2.2 Graph Moments under Mixed Membership Dirichlet Model 

We now analyze the graph moments for the general mixed membership Dirichlet model. Instead of the raw 
moments (i.e. edge and 3-star counts), we consider modified moments to obtain similar expressions as in the 
case of the stochastic block model. 

Let fix^A G KI^^I denote a vector which gives the normalized count of edges from X to A: 

t^x^A:=j^Y.^GlA]- (10) 
' I iex 
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We now define a modified adjacency matrix^ G^J"^ as 

Gx°A ■■= {V^^^Gx,A - (\/^^ - 1)1h1^a) ■ (11) 

In the special case of the stochastic block model (ao — > 0), GJ"^ — Gx,a is the submatrix of the adjacency 
matrix G. Similarly, we define modified third-order statistics, 

'^x%{A,B,c} -=("0 + 1)(Q!o + 2) Tx^{A,B,c} +2 al i^x^a O Mx-s-b Mx^-c 

- "'"^y Ma <» GIb ® Mx^c + GIa <^ l^x^B ® Glc + Mx^a G^^^ (g) G.^^] , 

' ' iex 

(12) 

and it reduces to the 3-star count T^x^{a,b,c} defined in (6) for the stochastic block model (ao 0)- The 
modified adjacency matrix and the 3-star count tensor can be viewed as a form of "centering" of the raw 

moments which simplifies the expressions for the moments. The following relationships hold between the 
modified graph moments G^"^, T"" and the model parameters P and a of the mixed membership model. 

Proposition 2.2 (Moments in Mixed Membership Model). Given partitions A, B, G, X and G^"^ and T"°, 
as in (11) and (12), normalized Dirichlet vector a, and F := 11^ P''^ , where P is the community connectivity 
matrix and H is the matrix of community membership vectors, we have 

E[(G«°^)^|n^, Hx] = Fa Bmg{a'/^)^x, (13) 

fc 

E[Tx%{A,B,C} \^A,nB,Ilc] ^J2a,{FA)^ ^ {Fb)^ ® {Fc)i, (14) 

where (F^), corresponds to i"* column of Fa and '^x relates to the community membership matrix Ux as 

^x := Diag(a- V2) ^^S^Hx - {V^^ - 1) ^ tt, 

Moreover, we have that 

\X\-'KnA^x^l]=I. (15) 

Remark: The 3-star count tensor T"" is carefully chosen so that the CP decomposition of the tensor 
directly yields the matrices Fa,Fb,Fc and Sj, as in the case of the stochastic block model. Similarly, the 
modified adjacency matrix (GJ°^)^ is carefully chosen to eliminate second-order correlation in the Dirichlet 
distribution and we have that |X|~^Enx [^'^''"] = / is the identity matrix. These properties will be exploited 
by our learning algorithm in Section 3. 

Proof: The proof is on lines of Proposition 2.1 for stochastic block models (ao — > 0) but more involved 
due to the form of Dirichlet moments. Recall E[G7^^|7rj, flyi] = FAT^i for a mixed membership model, 

and iJ.x^A ■■= jk]Y.iexGj,A^ therefore E[/xx-).A|nA, IIx] = Fa EieJf Equation (13) follows 

directly. For Equation (15), we note the Dirichlet moment, E[7r7r^] = ^j^q-j- Diag(a) + -^^^^adf^ , when 
TT Dir(Q!) and 

\X\-^E[^x^x] = Diag(S-i/2) [(ao + l)E[7r7r^] + (-2x/^^TT(\/^^TT - 1) 
+(x/S^- l)')E[7r]E[7r]^] Diag(S-V2) 
= Diag(a~^/^) (Diag(S) -|- aoSS''" -|- (— ao)Sa''") Diag(S~^^^) 
= /. 




^To compute the modified moments G™o, and T™o, we need to know the value of the scalar ao ;= "^^cxi, which is the 
concentration parameter of the Dirichlet distribution and is a measure of the extent of overlap between the communities. We 
assume its knowledge here. 
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On lines of the proof of Proposition 2.1 for the block model, the expectation in (14) involves multi-linear 
map of the expectation of the tensor products 7rig)7rig)7r among other terms. Collecting these terms, we have 
that 

is a diagonal tensor, in the sense that its {p,p,p)-th entry is cip, and its {p, q, r)-th entry is when p, q, r are 
not all equal. With this, we have (14). □ 

Note the nearly identical forms of the graph moments for the stochastic block model in (8), (9) and for 
the general mixed membership model in (13), (14). In other words, the modified moments A ^^'^ 
have similar relationships to underlying parameters as the raw moments in the case of the stochastic block 
model. This enables us to use a unified learning approach for the two models, outlined in the next section. 

3 Algorithm for Learning Mixed Membership Models 

The simple form of the graph moments derived in the previous section is now utilized to recover the com- 
munity vectors 11 and model parameters P,a of the mixed membership model. The method is based on 
the so-called tensor power method, used to obtain a tensor decomposition. We first outline the basic ten- 
sor decomposition method below and then demonstrate how the method can be adapted to learning using 
the graph moments at hand. We first analyze the simpler case when exact moments are available in Sec- 
tion 3.2 and then extend the method to handle empirical moments computed from the network observations 
in Section 3.3. 

3.1 Overview of Tensor Decomposition Through Power Iterations 

In this section, we review the basic method for tensor decomposition based on power iterations for a special 

class of tensors, viz., symmetric orthogonal tensors. Subsequently, in Section 3.2 and 3.3, we modify this 
method to learn the mixed membership model from graph moments, described in the previous section. For 
details on the tensor power method described below, refer to [3, 28]. 

Recall that a third-order tensor T is a three-dimensional array and we use Tp,q,r to denote the {p, q, r)-th 
entry of the tensor T. The standard symbol (S) is used to denote the Kronecker product, and {u(S>v(S>w) is a 
rank one tensor. The decomposition of a tensor into its rank one components is called the CP decomposition. 

Multi-linear maps: We can view a tensor T G M''^''^'' as a multilinear map in the following sense: for 
a set of matrices {Vi € M<^x™« : j g [3]}^ the (ii, i3)-th entry in the three-way array representation of 



The term multilinear map arises from the fact that the above map is linear in each of the coordinates, e.g. 
if we replace Vi by aVi + bWi in the above equation, where Wi is a matrix of appropriate dimensions, and 
a, b are any scalars, the output is a linear combination of the outputs under Vi and Wi respectively. We will 
use the above notion of multi-linear transforms to describe various tensor operations. For instance, T{I, I, v) 
yields a matrix, T{I,v,v), a vector, and T{v,v,v), a scalar. 

Symmetric tensors and orthogonal decomposition: A special class of tensors are the symmetric 
tensors T E ^dxdxd .^yjjj(2h are invariant to permutation of the array indices. Symmetric tensors have CP 
decomposition of the form 



T(Fi, V2, V3) e M^ix^^x^s is 



[T{VuV2,Vs)K,i,,i, := Yl T^uhJs [Vi]n,i, [V2]j,,i2 [V3h: 



ji,j2,j3e[d] 




(16) 
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where r denotes the tensor CP rank and we nsc the notation ■iif'^ := v, ® Vi ® Vi. It is eonvenient to first 
analyze methods for decomposition of symmetric tensors and we then extend them to the general case of 
asymmetric tensors. 

Further, a sub-class of symmetric tensors are those which possess a decomposition into orthogonal com- 
ponents, i.e. the vectors Vi G are orthogonal to one another in the above decomposition in (16) (without 
loss of generality, we assume that vectors {vi} are orthonormal in this case). An orthogonal decomposition 
implies that the tensor rank r < d and there are tractable methods for recovering the rank-one components 
in this setting. We limit ourselves to this setting in this paper. 



Tensor eigen analysis: For symmetric tensors T possessing an orthogonal decomposition of the form in 
(16), each pair (Ai,Vj), for i e [r], can be interpreted as an eigen-pair for the tensor T, since 

T{I,Vi,Vi) = ^ Xj {vi,Vjfvj = XiVi, Vi e [r], 

due to the fact that {vi,Vj) = Sij. Thus, the vectors {fi}ie[r] can be interpreted as fixed points of the map 

Til,v,v) 

\\Tii,v,v)r ^''^ 

where || • || denotes the spectral norm (and ||T(/, v, v)\\ is a vector norm), and is used to normalize the vector 
V in (17). 



Basic tensor power iteration method: A straightforward approach to computing the orthogonal 
decomposition of a symmetric tensor is to iterate according to the fixed-point map in (17) with an arbitrary 

initialization vector. This is referred to as the tensor power iteration method. Additionally, it is known that 
the vectors {fijieir] are the only stable fixed points of the map in (17). In other words, the set of initialization 
vectors which converge to vectors other than {vi}i^[r] are of measure zero. This ensures that we obtain the 
correct set of vectors through power iterations and that no spurious answers arc obtained. See [4. Thm. 
4.1] for details. Moreover, after an approximately fixed point is obtained (after many power iterations), the 
estimated eigen-pair can be subtracted out (i.e., deflated) and subsequent vectors can be similarly obtained 
through power iterations. Thus, we can obtain all the stable eigen-pairs {Ai, which arc the components 

of the orthogonal tensor decomposition. The method needs to be suitably modified when the tensor T is 
perturbed (e.g. as in the case when empirical moments are used) and we discuss it in Section 3.3. 



3.2 Learning Mixed Membership Models Under Exact Moments 

We first describe the learning approach when exact moments are available. In Section 3.3, we suitably modify 
the approach to handle perturbations, which are introduced when only empirical moments are available. 

We now employ the tensor power method described above to obtain a CP decomposition of the graph 
moment tensor T"" in (12). We first describe a "symmetrization" procedure to convert the graph moment 
tensor T"° to a symmetric orthogonal tensor through a multi- linear transformation of T"° . We then employ 
the power method to obtain a symmetric orthogonal decomposition. Finally, the original CP decomposition 
is obtained by reversing the multi-linear transform of the symmetrization procedure. This yields a guaranteed 
method for obtaining the decomposition of graph moment tensor T"° under exact moments. We note that 
this symmetrization approach has been earlier employed in other contexts, e.g. for learning hidden Markov 
models [4, Sec. 3.3]. 



Reduction of the graph- moment tensor to symmetric orthogonal form (Whitening): Recall 
from Proposition 2.2 that the modified 3-star count tensor T"° has a CP decomposition as 

k 

E[T"« \Ua,Ub,IIc] = ^S,(Fa). ^ {FB)^ ^ {Fc)i. 
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Wc now describe a symmctrization procedure to convert T"" to a symmetric orthogonal tensor through a 
multi-hnear transformation using the modified adjacency matrix G"°, defined in (11). Consider the singular 
value decomposition (SVD) of the modified adjacency matrix G"° under exact moments: 

\x\-'/^n{G2Ayn = UADAVj. 

Define Wa '■= UaD^, and similarly define Wb and Wc using the corresponding matrices ^ and C^^-, 
respectively. Now define 

Ra,b ■■= j^^W^niGx^eVm ■ E[(G^%)|n]W^, Wb := WbRa,b, (18) 

and similarly define Wc- We establish that a multilinear transformation (as defined in (3.1)) of the graph- 
moment tensor T"" using matrices Wa, Wb, and Wc results in a symmetric orthogonal form. 

Lemma 3.1 (Orthogonal Symmetric Tensor). Assume that the matrices Fa, Fb, Fc and Tlx have rank k, 
where k is the number of communities. We have an orthogonal symmetric tensor form for the modified 3-star 
count tensor T"° in (12) under a multilinear transformation using matrices Wa, Wb, and Wc: 

E[T'''>{Wa,Wb,Wc)\Ua,Ub,T1c] = J2 W e m'=x'=x^ (19) 

ie[k] 

where A, := a^^'^ and $ e M^^^fe orthogonal matrix, given by 

$ := WJFa Diag(a°-5). (20) 



Remark 1: Note that the matrix Wa orthogonalizes Fa under exact moments, and is referred to as a 
whitening matrix. Similarly, the matrices Wb = Ra,bWb and Wc = Ra,cWc consist of whitening matrices 

Wb and Wc, and in addition, the matrices Ra.b and Ra.c serve to symmetrize the tensor. Wc can interpret 
{Ai, (^)i}ie[fc] as the stable eigen-pairs of the transformed tensor (henceforth, referred to as the whitened and 
symmetrized tensor). 

Remark 2: The full rank assumption on matrix Fa = II^P^ G kI^I^^ implies that |^| > k, and 
similarly |i3|,|C|,|X| > k. Moreover, we require the community connectivity matrix P e M'^^'s be of 
full rank (which is a natural non-degeneracy condition). In this case, we can reduce the graph- moment 
tensor T"° to a A;-rank orthogonal symmetric tensor, which has a unique decomposition. This implies that 
the mixed membership model is identifiable using 3-star and edge count moments, when the network size 
n = |A| + |_B| + |C| + |X| > 4fc, matrix P is full rank and the community membership matrices Ha, Hb, He, Hx 
each have rank k. On the other hand, when only empirical moments are available, roughly, we require the 
network size n = fl{k'^{ao + 1)^) (where ao := J2i related to the extent of overlap between the commu- 
nities) to provide guaranteed learning of the community membership and model parameters. See Section 4 
for a detailed sample analysis. 

Proof: Recall that the modified adjacency matrix G°'° satisfies 
E[(G«';^)^|n^,nx] = FAT>mg{a^'^)<i>x. 

*x := Diag(a-i/2) ^V^^n^f - (vWTT- 1) (^^Y.""^ • 

Prom the definition of x above, we see that it has rank k when lix has rank k. Using the Sylvester's rank 
inequality, we have that the rank of Fa Diag(a^/^)^'x is at least 2k — k = k. This implies that the whitening 
matrix Wa also has rank k. Notice that 

|x|-ii^jE[(G^«^)T|n] • E[(G^°^)|n]w^A = d^'uJuadIuJuaD^' = I e r'^-", 
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or in other words, \X\-''-MM''' = I, where M := WJFa Diag(S^/^)*x- We now have that 

I=\X\-'^Eu^ [MM'^] = \X\-^WjFA'Dia.g{a^/'^)E[^x^x]'Dia.g{a^/'^)FjWA 
= WjFA'Dia.g{a)FjWA, 

since |X|^^Enx [^Jf ^x] = ^ from (15), and we use the fact that the sets A and X do not overlap. Thus, 
Wa whitens F4 Diag(a"'^/^) under exact moments (up on taking expectation over Tlx) and the columns of 
WJFa Diag(ai/2) are orthonormal. Now note from the definition of Wb that 

W^niG^x^BVm = T^lE[(G«°^)^|n], 

since Wb satisfies 

|X|-iW^^E[(G«°5)^|n] . E[(G«°5)|n]iyB = /, 

and similar result holds for Wc- The final result in (19) follows by taking expectation of tensor T"" over 
Tlx. □ 



Overview of the learning approach under exact moments: With the above result in place, we 
are now ready to describe the high-level approach for learning the mixed membership model under exact 
moments. First, symmetrize the graph- moment tensor T"" as described above and then apply the tensor 
power method described in the previous section. This enables us to obtain the vector of eigenvalues A := 
and the matrix of eigenvectors $ = VFJFa Diag(S°-^) using tensor power iterations. We can then 
recover the community membership vectors of set A" (i.e., nodes not in set A) under exact moments as 

Hac ^ Diag(X)-'<!>'^WjE[Gl.jU], 

since E[Gjic,^|n] = FaUao (since A and A" do not overlap) and Diag(A)-i$^WKj; = Bia.g{a)FjWAWj 
under exact moments. In order to recover the community membership vectors of set A, viz.. Ha, we can 
reverse the direction of the 3-star counts, i.e., consider the 3-stars from set A to X, B, C and obtain 11^ in a 
similar manner. Once all the community membership vectors 11 are obtained, we can obtain the community 
connectivity matrix P, using the relationship: II^PII = lE[G|n] and noting that we assume 11 to be of rank 
k. Thus, wc arc able to learn the community membership vectors 11 and the model parameters a and P of 
the mixed membership model using edge counts and the 3-star count tensor. We now describe modifications 
to this approach to handle empirical moments. 

3.3 Learning Algorithm Under Empirical Moments 

In the previous section, we explored a tensor-based approach for learning mixed membership model under 
exact moments. However, in practice, we only have samples (i.e. the observed network), and the method 
needs to be robust to perturbations when empirical moments are employed. 

3.3.1 Pre-processing steps 

Partitioning: In the previous section, we partitioned the nodes into four sets A, B, C, X for learning 
under exact moments. However, we require more partitions under empirical moments to avoid statistical 
dependency issues and obtain stronger reconstruction guarantees. We now divide the network into five non- 
overlapping sets A,B,C,X,Y. The set X is employed to compute whitening matrices Wa, Wb and Wc, 
described in detail subsequently, the set Y is employed to compute the 3-star count tensor T"" and sets 
A, B, C contain the leaves of the 3-stars under consideration. The roles of the sets can be interchanged to 
obtain the community membership vectors of all the sets. 
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Algorithm 1 {11, P, a} ^ LearnMixedMembership(G, fc, ao, 



Input: Adjacency matrix G € M"^", k is the number of communities, ao := ai, where a is the Dirichlet 

parameter vector, A'^ is the number of iterations for the tensor power method, and t is used for thresholding 
the estimated community membership vectors . Let A'^ := [n] \ A denote the set of nodes not in A. 
Output: Estimates of the community membership vectors 11 S M"^'=, community connectivity matrix 

P E [0, l]*^^'^, and the normahzcd Dirichlet parameter vector S. 
Partition the graph G into 5 parts X, Y, A, B, C. 

Compute moments G^°^, G'^°g, GJ°p, T^%^^ ^.c} ^^ing (11) and (12). 

{ilAc,a} ^ LearnPartitionCommunity(GJ°^, G'^°g, T^^^^ g c}^ G,N,t). 

Interchange roles^° of Y and A to obtain TIyh. 

Define Q := ^ (flDiag(a-i) - ^11^) 

Estimate P ^ QGQ^ . 
Return 11, P, a 



Procedure 1 {n/ic,Q} <— LearnPartitionCommunity(Gj"^, G^°^, G'^'^j, "^y'^{a B c}^ ^) 

Input: Require modified adjacency submatrices G^"^^, G'^g, G'^q, 3-star count tensor T^y%{a b c}' 
adjacency matrix G, number of iterations N for the tensor power method and threshold r for thresholding 
estimated community membership vectors. Let Thres(^, t) denote the element-wise thresholding operation 
using threshold r, i.e., Thres(>l, r)i j = Aij if Aij > r and otherwise. Let Ci denote basis vector along 
coordinate i. 

Output: Estimates of II^ic and a. 

Compute rank-fc SVD: {Gx,A)k-svd = UaDaVJ ^^'^ compute whitening matrices Wa ■= UaD^- Simi- 
larly, compute Wb,Wc and Rab,Rac using (21). 

Compute whitened and symmetrized tensor T ^ T^y°->-{a b c}(^^' ^bRab,WcRac)- 

{A, $} ^TensorEigen(T, {WjGjj^}i(^A^ is a k x k matrix with each columns being an estimated 

eigenvector and A is the vector of estimated eigenvalues.} 

IIac ^ Thres(Diag(A)-i$^VF^G]ic_^ , r) and a^ ^ for i e [k]. 

Return 11^= and a. 
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Whitening: The whitening procedure is along the; same hncs as described in tlic previous section, except 
that now empirical moments are used. Specifically, consider the fc-rank singular value decomposition (SVD) 
of the modified adjacency matrix G"° defined in (11), 

\X\-^/^G2A)I-s.d = UaDaVJ. 

Define Wa ■= UaD^, and similarly define Wb and Wc using the corresponding matrices s ^'xc 
respectively. Now define 

Ra,b := ^W^i^lGj^B)^..^ • {G'icy)k-evciWA, (21) 

and similarly define Rac- The whitened and symmetrized graph-moment tensor is now computed as 

Tr%{A,s,c}(^^' WbRab, WcRac), 
where T"° is given by (12) and the multi-linear transformation of a tensor is defined in (3.1). 

3.3.2 Modifications to the tensor power method 

Recall that under exact moments, the stable eigen-pairs of a symmetric orthogonal tensor can be computed 
in a straightforward manner through the basic power iteration method in (17), along with the deflation 
procedure. However, this is not sufficient to get good reconstruction guarantees under empirical moments. 
We now propose a robust tensor method, detailed in Procedure 2. The main modifications involve: (i) 
efficient initialization and (ii) adaptive deflation, which are detailed below. 

Procedure 2 {A,$} •(-TensorEigen(T, {vi)i^[L],N) 

Input: Tensor T G R'^^'^^'^, set of L initialization vectors {fijiei, number of iterations N . 

Output: the estimated eigenvalue/eigenvector pairs {A, where A is the vector of eigenvalues and $ is 

the matrix of eigenvectors. 

for i = 1 to fc do 
for T = 1 to L do 

6o Vr- 

ior t = lto N do 
T ^ T. 

for j = 1 to i — 1 (when i > 1) do 
if \Xj(e[^\(l)j)\ >^ then 

end if 
end for 

(x) Til 0^^^ 0^^^ ) 

Compute power iteration update 6^' := n^,/ J;/' 



end for 
end for 



Let T* := argmax,ei{T((^(;\(^;;\0^))} 



Do N power iteration updates starting from ^ to obtain eigenvector estimate 1?^,, and set A, := 



T{(f)i,<pi,(i)i). 

end for 

return the estimated eigenvalue/eigenvectors (A, $). 
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Efficient Initialization: Recall that the basic tensor power method incorporates generic initialization 
vectors and this procedure recovers all the stable eigenvectors correctly (except for initialization vectors over 
a set of measure zero). However, under empirical moments, we have a perturbed tensor, and here, it is 
advantageous to instead employ specific initialization vectors. For instance, to obtain one of the eigenvectors 
it is advantageous to initialize with a vector in the neighborhood of This not only reduces the 

number of power iterations required to converge (approximately), but more importantly, this makes the 
power method more robust to perturbations. See Theorem B.l in Appendix B.l for a detailed analysis 
quantifying the relationship between initialization vectors, tensor perturbation and the resulting guarantees 
for recovery of the tensor eigenvectors. 

For a mixed membership model in the sparse regime, recall that the community membership vectors 11 
are sparse (with high probability) . Under this regime of the model, we note that the whitened neighborhood 
vectors contain good initializers for the power iterations. Specifically, in Procedure 2, we initialize with the 
whitened neighborhood vectors WJgJj^, for i ^ A. The intuition behind this is as follows: for a suitable 
choice of parameters (such as the scaling of network size n with respect to the number of communities k), we 
expect neighborhood vectors Gj^ to concentrate around their mean values, viz., , Fatti. Since Wi is sparse 

(w.h.p) for the model regime under consideration, this implies that there exist vectors WJFatTi, for i G A", 
which concentrate (w.h.p) on only along a few eigen-directions of the whitened tensor, and hence, serve as 
an effective initializer. 



Adaptive Deflation: Recall that in the basic power iteration procedure, we can obtain the eigen-pairs 
one after another through simple deflation: subtracting the current estimates of eigen-pairs and running the 
power iterations again. However, it turns out that we can establish better robustness guarantees when we 
adaptively deflate the tensor in each power iteration. In Procedure 2, among the estimated eigen-pairs, we 
only deflate those which "compete" with the current estimate of the power iteration. In other words, if the 
vector in the current iteration has a significant projection along the direction of an estimated eigen-pair, 
then the eigen-pair is deflated; otherwise it is retained and not deflated. This allows us to carefully control 
the error build-up for each estimated eigenpair in our analysis. See Theorem B.l in Appendix B.l for details. 

In addition, wc note that stabilization, as proposed in [28] for general tensor eigen-decomposition (as 
opposed to orthogonal decomposition in this paper), can be effective in improving convergence, especially 
on real data, and we defer its detailed analysis to future work. 



3.3.3 Reconstruction after tensor power method 

Recall that previously in Section 3.2, when exact moments arc available, estimating the community mem- 
bership vectors U is straightforward, once we recover all the stable tensor eigen-pairs. However, in case of 
empirical moments, we can obtain better guarantees with the following modification: the estimated com- 
munity membership vectors H arc further subject to thresholding so that the weak values arc set to zero. 
Since we are limiting ourselves to the regime of the mixed membership model, where the community vectors 
n are sparse (w.h.p), this modification strengthens our reconstruction guarantees. This thresholding step is 
incorporated in Algorithm 1. 

Moreover, recall that under exact moments, estimating the community connectivity matrix P is straight- 
forward, once we recover the community membership vectors since P (n^)^E[G|n]n^. However, when 
empirical moments are available, we are able to establish better reconstruction guarantees through a different 
method, outlined in Algorithm 1. We define 



-fnDiag(a-i)-^^ir 



based on estimates H and a, and the matrix P is obtained as P QGQ^ . Note that under exact moments, 
we have II = H, i.e., community vectors are recovered perfectly, and 

lEn[Qn^] = fE[nDiag(A2)n^] - ^L-^irEp^]) = I, 



17 



where; Q is the counterpart of Q under exact moments. Thus, we see that in the case of exact moments, the 
community connectivity matrix P is recovered exactly. We now carry out careful sample analysis for the 
above method and the results are summarized in the next section. 

Improved support recovery estimates in homophilic models: A sub-class of community model 
are those satisfying homophily. As discussed in Section 1, homophily or the tendency to form edges within 
the members of the same community has been posited as an important factor in community formation, 
especially in social settings. Many of the existing learning algorithms, e.g. [44], require this assumption to 
provide guarantees in the stochastic block model setting. 

Specifically, we require models with commmiity connectivity matrix P £ [0,1]'^^'° satisfying P{i./i) > 
P{i,j) for all i ^ j. For such models, we can obtain improved estimates by averaging, and is detailed in 
Procedure 3. Specifically, consider nodes in set C and edges going from C to nodes in B. First, consider 
the special case of the stochastic block model: for each node c G C, compute the number of neighbors in B 
belonging to each community (as given by the estimate H from Algorithm 1), and declare the community 
with the maximum number of such neighbors as the community of node c. Intuitively, this provides a better 
estimate for lie since we average over the edges in B. This method has been used before in the context 
of spectral clustering [32]. The same idea can be extended to the general mixed membership (homophilic) 
models: declare communities to be significant if they exceed a certain threshold, as evaluated by the average 
number of edges to each community. The details are provided in Procedure 3. In the next section, we 
establish that in certain regime of parameters, this procedure can lead to zero-error support recovery of 
significant community memberships of the nodes and also rule out communities where a node does not have 
a strong presence. 



Procedure 3 {S} ^ SupportRecoveryHomophilicModels(G, fc, ao, 11) 

Input: Adjacency matrix G € R"^", k is the number of communities, ao := X^jCii, where a is the 
Dirichlet parameter vector, ^ is the threshold for support recovery, corresponding to significant community 
memberships of an individual. Get estimate II from Algorithm 1. Also specify if the model is homophilic: 
whether P{i,i) > P{i,j), for all i ^ j. 
Output: S € {0, is the matrix supported on large entries of II, i.e. declare S{i,j) = 1 if the (revised) 

estimate of Tl{i,j) > ^, and otherwise, 
if Model satisfies homophily then 

{Now provide improved estimates for support recovery in homophilic models} 
Consider partitions A, B, C, X, Y as in Algorithm 1. 

Define := (ao + l)7j5^j — I" n^sT"-^' similarly define Q'^ {Define Q using H from Algorithm 1.} 

Estimate Fc ^ Gc,bQb, P ^ QcFc- {Define a new estimate for P.} 
if ao = (stochastic block model) then 
for c G C do 

Let i* ^ argmaxjgj-;.] Fc(c,i) and He <— Ci*. {Assign community with maximum average degree.} 
end for 
else 

Let H be the average of diagonals of P, L be the average of off-diagonals of P 
for c e C, i e [k] do 

S{i, c) ^ 1 if Fc{c, i)>L + {H-L)-§ and zero otherwise. {Identify large entries} 
end for 
end if 

Permute the roles of the matrices to get results for A, B, X, Y. 
else 

S{i,j) I[II(i, j) > ^]. {For general models, use the original II estimate for support recovery.} 
end if 
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Computational complexity: Wc note that the computational complexity of the method is 0{n^k + 
fc'*-^'^a~jjj) when ccq > 1 and O^ti^k) when ao < 1- This is because the time for computing whitening 
matrices is dominated by SVD of the top k singular vectors of n x n matrix, which takes 0{n'^k) time. We 
then compute the whitened tensor T which requires time 0{n^k + k^n) = 0{n?k), since for each i G 1", we 
multiply Gi^A,Gi,B,Gi,c with the corresponding whitening matrices, and this step takes 0{nk) time. We 
then average this k x k x k tensor over different nodes i gY to the result, which takes O(fc^) time in each 
step. 

For the tensor power method, the time required for a single iteration is 0{k^). We need at most logn 
iterations per initial vector, and we need to consider 0{a.^^^k^ '^^) initial vectors (this could be smaller when 
ao < 1). Hence the total running time of tensor power method is 0(fc^'^^S~?jj) (and when ao is small this 
can be improved to 0(fc'*a~;jj) which is dominated by 0{n'^k). 

In the process of estimating 11 and P, the dominant operation is multiplying kxn matrix by n x n matrix, 
which takes 0{n^k) time. For support recovery, the dominant operation is computing the "average degree", 
which again takes 0(n'^k) time. Thus, we have that the overall computational time is 0{n^k + A;^ '*^a~Jjj) 
when ao > 1 and 0{'n?k) when ao < 1. 



4 Sample Analysis for Proposed Learning Algorithm 

4.1 Sufficient Conditions and Recovery Guarantees 

We now provide recovery guarantees for the proposed learning algorithm under empirical moments under 
a sufficient set of conditions, involving scaling of various parameters such as network size n, number of 
communities fc, concentration parameter ao of the Dirichlet distribution (which is a measure of overlap of 
the communities) and so on. 



[Al] Sparse regime of Dirichlet parameters: The community membership vectors are drawn from 
the Dirichlet distribution, Dir(a), under the mixed membership model. We assume that^^ a^ < 1 for i S [k] 
ai < 1 for all i G [k] (see Section 2.1 for an extended discussion on the sparse regime of the Dirichlet 
distribution). 



[A2] Condition on the network size: Given the concentration parameter of the Dirichlet distribution, 
ao := J2i O'i) and Smin := (Xmin/cto, the expected size of the smallest community, define 

p:=— . (22) 

(^min 

Given 6 G (0, 1), we require that the network size scale as 

k 



n = nl^p'\og'-j, (23) 

and that the partitions A,B,C,X,Y are 0(n). Note that from assumption Al, a^ < 1 which implies that 
ao < k. Thus, in the worst-case, when ao = Q{k), we require^^ n = f2(A;^), assuming equal sizes: a, = 1/k, 
and in the best case, when ao = ©(1), we require n = f2(A;^). The latter case includes the stochastic block 
model (ao = 0), and thus, om- results match the state-of-art bounds for learning stochastic block models. 

See Section 4.2 for an extended discussion. 



'^'^The assumption Al that the Dirichlet distribution be in the sparse regime is not strictly needed. Our results can be 
extended to general Dirichlet distributions, but with worse scaling requirements on n. The dependence of n is still polynomial 
in ao, i.e. we require n = Cl({ao + 1)'^^"?^), where c > 2 is some constant. 

i^The notation n(-),0(-) denotes f2(-),0(-) up to log factors. 
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[A3] Condition on relative community sizes and block connectivity matrix: 

[0, l]'^^'^ denotes the block connectivity matrix. Define 

V \/(maxj(PS),) 

where criniii(-P) is the minimum singular value of P. We require that 



Recall that P e 



(24) 



c= { 



o 



o 



,1/2 



,1/2 



pka„ 



ao < 1 
ao > 1. 



(25) 
(26) 



Intuitively, the above condition requires the ratio of maximum and minimum expected community sizes to 
be not too large and for the matrix P to be well conditioned. The above condition is required to control 
the perturbation in the whitened tensor (computed using observed network samples), thereby, providing 
guarantees on the estimated eigen-pairs through the tensor power method. The above condition can be 
interpreted as a separation requirement between intra-community and inter-community connectivity in the 
special case considered in Section 4.2. 



[A4] Condition on number of iterations of the power method: 

iterations N of the tensor power method in Procedure 2 satisfies 



We assume that the number of 



N>C2- log(A;)+loglog 



CTn 



(maxj(Pa)j) 



for some constant C2. 

[A5] Choice of r for thresholding community vector estimates: 
estimates 11 of community membership vectors in Algorithm 1 is chosen as 




^ J' 



),V2 . 



ao 7^ 0, 
ao = 0, 



(27) 

The threshold r for obtaining 

(28) 
(29) 



For the stochastic block model (ao = 0), since tt^ is a basis vector, we can use a large threshold. For general 
models (ao 7^ 0), r can be viewed as a regularization parameter and decays as n~^/^ when other parameters 
are held fixed. Moreover, when n = 6(p^), we have that r ~ p~^/^ when other terms are held fixed. Recall 
that p oc (ao + 1) when the expected community sizes a, are held fixed. In this case, r ~ p"^/^ allows for 
smaller values to be picked up after thresholding as ao is increased. This is intuitive since as ao increases, 
the community vectors n are more "spread out" across different communities and have smaller values. 

We are now ready to state the error bounds on estimating 11 and P using Algorithm 1. The proofs are 
given in the Appendix and a proof outline is provided in Section 4.3. Recall that for a matrix M, {My and 
{M)i denote the i*** row and column respectively. 



Theorem 4.1 (Guarantees on estimating P, H) 
from Algorithm 1 satisfy with high probability, 



Under assumptions A1-A5, The estimates P andtl obtained 



e^,i, := maxKflx)^ - {XlzfU = 6 (n^/^ ■ p3/2 . ^ . 

iG [k] \ / 

ep := max \{QGQ'')ij - Pi,j\ = (n-^^ . p5/2 . ^ . g3/2 ^ 



(30) 
(31) 
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Thus the above result provides recovery guarantees on the estimates of n and P. The main ingredient in 
establishing the above result is the tensor concentration bound and additionally, recovery guarantees under 
the tensor power method in Procedure 2. We now provide these results below. 

Recall that Fa ■= H^P^ and $ = WJFa Diag(a^/^) denotes the set of tensor eigenvectors under 
exact moments in (20), and $ is the set of estimated eigenvectors under empirical moments, obtained using 
Procedure 1. We establish the following guarantees. 

Lemma 4.2 (Perturbation bound for estimated eigen-pairs). Under the assumptions A1-A4, the recovered 
eigenvector-eigenvalue pairs ($i,Ai) from the tensor power method in Procedure 2 satisfies with probability 
1 — 1095, for a permutation 9, such that 

max||l>i - < 8a^/^£T, max|Ai - a~}''^\ < 5st, (32) 

The tensor perturbation bound et is given by 



St ■■= 



T^yU{a,b,c}(Wa, WbRab, WcRac) - E[T«%^^ ,,_^j(W^, Wb, Wc)\nAuBuc] 



= 0(4--^), (33) 



. V tXmaj 

where \\T\\ for a tensor T refers to its spectral norm, p is defined in (22) and ( in (24). 
4.2 Special case: uniform community sizes and structured P 

It is easier to interpret the results from the earlier section for the special case, where all the communities 
have the same expected size and the entries of the community connectivity matrix P are equal on diagonal 
and off-diagonal locations: 

ai = ^, P{i,j)=p-I{i=j) + q-I{iy^j), p>q. (34) 

In other words, the probability of an edge according to P only depends on whether it is between two 
individuals of the same communities or between different communities. The above setting is well studied for 
stochastic block models (ao = 0) and we compare our results with existing results for this setting. 
In this setting, we have 

p q 

o-iniii(-P) = 6(p - q), max(Pa)i = - + {k - 1)- < p. 
Thus, the assumptions A2 and A3 in Section 4.1 reduce to 

„ = fi(t>. + l)^ c = e(^)=0(^^). (35) 

Thus, we obtain transparent conditions on scaling for n and the separation p — q between intra-community 
and inter-community connectivity. We now provide recovery guarantees for this setting. 

Corollciry 4.3 (Uniform community sizes and structured P). Under assumptions A1-A5 in Section 4-1 and 
(34), we have with high probability 



e^,i, :=max||ff-ff||i = 



'(ao + 1)'/' 



up 



(p-q) 

^3/2, 



sp := max \{QGQ )^j - Pij \ = O — . ^ 

^ \ {p-q)Vn ^ 

where et is the (whitened) tensor perturbation bound defined in (33). 
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Note that the assumption p > q is not required for the above resuhs in Corohary 4.3 to hold and we 
can replace p by max(p, q) and p — q with \p — q\ in the above bounds. However, note that we require 
the assumption that p > q in Section 4.2.1 to provide improved guarantees for support recovery using 
Procedure 3. 



Stochastic block models {ao = 0): For stochastic block models, (35) reduces to 

n = n{k% C = o(^)=o(^). (36) 



p — qj \ k 

This matches with the best known scaling (up to poly- log factors), and was previously achieved via convex 
optimization in [44] for stochastic block models. However, our results in Corollary 4.3 do not provide zero 
error guarantees as in [44] since it assumes a general mixed membership model. We strengthen our results to 
provide zero-error guarantees in Section 4.2.1 and thus, match the scaling of [44] for stochastic block models. 
Moreover, we also provide zero-error support recovery guarantees for recovering significant memberships of 
nodes in mixed membership models in Section 4.2.1. 



Dependence on ao: The guarantees degrade as ao increases, which is intuitive since the extent of 
community overlap increases. The requirement for scaling of n also grows as ao increases. Note that the 
guarantees on s„ and ep can be improved by assuming a more stringent scaling of n with respect to ao, 
rather than the one specified by (35). 



4.2.1 Zero-error guarantees for support recovery 

Recall that we proposed Procedure 3 to provide improved support recovery estimates for homophilic models 
(where there are more expected edges within each community than to any community outside). We now 
provide guarantees for this method. We limit our analysis to the setting in (34) with uniform communities 
and structured matrix P. In principle, the analysis can be extended to more general homophilic models with 
suitable modifications to the method in Procedure 3. 

We now specify the threshold ^ for support recovery in Procedure 3. 



[A6] Choice of ^ for support recovery: We assume that the threshold ^ in Procedure 3 satisfies 

e = fi(£p), 

where ep is specified in Corollary 4.3. We now state the guarantees for support recovery. 

Theorem 4.4 (Support recovery guarantees). Assuming A1-A6 and (34) hold, the support recovery method 
in Procedure 3 has the following guarantees on the estimated support set iS: with high probability, 

U{i,j)>^^Sii,j) = l and n(i,j)< |^5(i,j)=0, Vi G [k],j € [n], (37) 

where H is the true community membership matrix. 

Thus, the above result guarantees that the Procedure 3 correctly recovers all the "large" entries of H and 
also correctly rules out all the "small" entries in H. In other words, we can correctly infer all the significant 
memberships of each node and also rule out the set of communities where a node does not have a strong 
presence. 

The only shortcoming of the above result is that there is a gap between the "large" and "small" vahies, and 
for an intermediate set of values (in [^/2,^]), we cannot guarantee correct inferences about the community 
memberships. Note this gap depends on ep, the error in estimating the P matrix. This is intuitive, since as 
the error ep decreases, we can infer the community memberships over a large range of values. 

For the special case of stochastic block models (i.e. limao — >■ 0), we can improve the above result and 
give a zero error guarantee at all nodes (w.h.p). Note that we no longer require a threshold ^ in this case, 
and only infer one community for each node. 
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Corollary 4.5 (Zero error guarantee for block models). Assuming A1-A5 and (34) hold, the support recovery 
method in Procedure 3 correctly identifies the community memberships for all nodes with high probability in 
case of stochastic block models (ao 0) . 

Thus, with the above result, we match the state-of-art results of [44] for stochastic block models in terms 
of scaling requirements and recovery guarantees. 

4.3 Proof Outline 

We now summarize the main techniques involved in proving Theorem 4.1. The details are in the Appendix. 
The main ingredient is the concentration of the adjacency matrix: since the edges are drawn independently 
conditioned on the community memberships, we establish that the adjacency matrix concentrates around its 
mean under the stated assumptions. See Appendix C.4 for details. With this in hand, we can then establish 
concentration of various quantities used by our learning algorithm. 

Step 1: Whitening matrices. We first establish concentration bounds on the whitening matrices Wa, 
Wb, Wc computed using empirical moments, described in Section 3.3.1. With this in hand, we can approx- 
imately recover the span of matrix Fa since ly^FDiag(Si)^/^ is a rotation matrix. The main technique 
employed is the Matrix Bernstein's inequality [40, thm. 1.4]. See Appendix C.2 for details. 

Step 2: Tensor concentration bounds Recall that we use the whitening matrices to obtain a symmetric 

orthogonal tensor. We establish that the whitened and symmetrized tensor concentrates around its mean. 
This is done in several stages and we carefully control the tensor perturbation bounds. See Appendix C.l 
for details. 

Step 3: Tensor power method analysis. We analyze the performance of Procedure 2 under empirical 
moments. We employ the various improvements, detailed in Section 3.3.2 to establish guarantees on the 

recovered eigen-pairs. This includes coming up with a condition on the tensor perturbation bound, for the 
tensor power method to succeed. It also involves establishing that there exist good initializers for the power 
method among (whitened) neighborhood vectors. This allows us to obtain stronger guarantees for the tensor 
power method, compared to earlier analysis in [4]. This analysis is crucial for us to obtain statc-of-art scaling 
bounds for guaranteed recovery (for the special case of stochastic block model). See Appendix B for details. 

Step 4: Thresholding of estimated community vectors In Step 3, we provide guarantees for recovery 
of each eigenvector in £2 norm. Direct application of this result only allows us to obtain £2 norm bounds for 
row-wise recovery of the community matrix 11. In order to strengthen the result to an ti norm bound, we 
threshold the estimated 11 vectors. Here, we exploit the sparsity in Dirichlet draws and carefully control the 
contribution of weak entries in the vector. Finally, we establish perturbation bounds on P through rather 
straightforward concentration bound arguments. See Appendix A. 2 for details. 

Step 5: Support recovery guarantees. It is convenient to consider the case of in stochastic block model 
here in the canonical setting of Section 4.2. Recall that Procedure 3 readjusts the community membership 
estimates based on degree averaging. For each vertex, if we count the average degree towards these "approx- 
imate communities" , for the correct community the result is concentrated around value p and for the wrong 
community the result is around value q. Therefore, we can correctly identify the community memberships 
of all the nodes, when p — q is sufficiently large, as specified by (35). The argument can be easily extended 
to general mixed membership models. See Appendix A. 3 for details. 
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A Proof of Theorem 4.1 



A.l Proof of Lemma 4.2 

We have the tensor perturbation bound in Theorem C.l as follows: given 6 G (0, 1) and p := 1°"*"^ and 

n = o(p^log^^), (38) 

and 

(P) ' 

the following tensor perturbation bound holds with probability 1 — 100i5, 



T^Y%{A,B,C}(^A, WbRaB, WcRac) ' m^\{A,B,C}iWA. Wb, WcWauBUc] 



= 0(^-Jj^]. (39) 



a 



max 



From Theorem B.l, we require that the perturbation of the tensor be small enough according to 

ST < C,a-li\l (40) 

for some constant Ci, in order to guarantee recovery of the eigen-pairs under the tensor power iteration 
method, when initialized with a (7,ro) good vector. 

By Lemma C.IO, when ( = 0{y/nrQ/p), we have good initial vectors. The requirement that Et < 
CiarriaxfQ turns out to be equivalent to C = 0{\fnr\l p). 

Therefore when C, = 0{yjnr^l p)^ the assumptions of Theorem B.l are satisfied. Recall that Tq = 
maxk) when ao > 1 and Tq — 0(1) for ao < 1. 

Additionally from Lemma C.IO, in order to obtain good initialization vectors with probability 1 — 9^ 
under Dirichlet distribution, we require that 

n = n{a-lk''-^Hog{k/6)), (41) 

when ao > 1, which is always satisfied since we are assuming < n. 

From Theorem B.l, we see that the tensor power method returns eigenvalue- vector pair (Ai,$i) such 
that there exists a permutation 6 such that 

m^ - II < Sall^^ET, (42) 



and 



max|Ai — ^^(^l^l < 5£t- (43) 



A. 2 Reconstruction after tensor power method 

Let {My and (M), denote the i"' row and i"* column in matrix M respectively. Let Z C A'^ denote any 
subset of nodes not in A, considered in Procedure LearnPartition Community. Define 

Ilz := Diag(A)-i$^WlG5,^. 

Recall that the final estimate Hz is obtained by thresholding flz element- wise with threshold r in Proce- 
dure 1. We first analyze perturbation of Ilz- 
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Lemma A.l (Reconstruction Guarantees for Hz)- Assuming Lemma 4.2 holds and the tensor power method 
recovers eigenvectors and eigenvalues up to the guaranteed errors, we have with probability 1 — 1225, 

:= maxIKiiz)^ - (H^)! = O [eTaHl (^\'^ ||n^|| | , 



1/2 
max 



*max 



1/2N 



where et is given by (39). 

Proof: We have {flzY = K^{{^)iVWjGj y^. We will now use perturbation bounds for each of the terms 

to get the result. 
The first term is 

II Diag(A.)-i - Diag(Sl/^)|| • || Dmgia'/^)Fj\\ ■ \\Fa\\ ■ \\Uz\\ 
< 5£TamaxS,;[f (l + eO'lln^ll 
from the fact that || Diag(S^/^)F4 || < 1 + ei, where ei is given by (68). The second term is 

II Diag(SV2)|| . ||($), _ Sy^(f'^),|| . IIF^II . Iln^ll 



<8a^ax£TaJf (l + £i)||nz| 



The third term is 



a. 



1/2, 



■\\(Wj-Wj)FAnz\ 



<^UL^Ji'\\^z\\ew (44) 



1/2 



<0{{^] srarnllnzll 1 , (45) 



from Lemma C.2 and finally, we have 

\\al^^\\-\\WA\\-\\Gl,-FAnz\ 



< O (aUL^^^^^^^^J (niax(Pa)O(l + £2 + £3) log J ] (46) 

from Lemma C.7 and Lemma C.8. 

The third term in (45) dominates the last term in (47) since (ao + 1) log fc/J < nSmin (due to assumption 
A2 on scaling of n). □ 

We now show that if we threshold the entries of Hz, the the resulting matrix IIz has rows close to those 
in Hz in £i norm. 

Lemma A. 2 (Guarantees after thresholding). For Ilz ■= Thres(n2, r), where r is the threshold, we have 
with probability 1 — 2S, that 



max |(nz)^-(n2)*|i = 1 V^e^Wlog^fl- ' 21og(fc/<5) 



Teffc] ' ^ I ^ V ^ 2r \ V "-'7log(l/2r) 



+nriT + \j {nr] + At'^) log ^ + ^ 



where r] = Smax when ao < 1 and rj = Umax when ao & [i-,k). 
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Remark 1: The above guarantee on Ilz is stronger than for Ilz in Lemma A.l since this is an £i guarantee 
on the rows compared to £2 guarantee on rows for Hz- 

Remark 2: When r is chosen as 

N r\ \ r S "max \ 



r 



e(^) = e 



we have that 

max|(flz)* - (nz)1i = O (v^- 
ie[fe] 



Proof: Let Si := {j : Ilzii,j) > 2t}. For a vector v, let vs denote the sub-vector by considering entries in 
set S. We now have 

mzY - (UzYli < litizYs, - (nz)kli + mzYsfli + mzYs^U 

Case ao < 1: From Lemma C.ll, we have P[n(i, 7) > 2t] < SSi log(l/2T). Since are independent 

for j G Z, we have from multiphcative Chernoff bound [26, Thm 9.2], that with probabihty 1 — 6, 



max|Sj| < SnSmaxlog 



1 \ / / 2\og{k/S) 



ie[k] J \ y «Sjlog(l/2T)y 

We have 

mzYs.-i^zYsM<e.\S.\'^^ 

and the i"* rows of Ilz and liz can differ on Si, we have \Ilz{i,j) — tlz{i,j)\ < t, ioi j & Si, and number of 
such terms is at most £^/t^. Thus, 

mzYs.-{^zYsM<'f- 

For the other term, from Lemma C.ll, we have 

E[Iiz{i,j) ■ d(nz{i,j) < 2t)] < ai{2T). 
Applying Bernstein's bound we have with probability 1 — ^ 



max Vnz(i,j) • 5{nz{i,j) < 2r) < namax(2r) + W2(namax + 4r2) log ^. 
ielk] f^^ V 6 

For fl^c, we further divide 5f into Tj and Ui, where Ti := {j : t/2 < Uz{i,j) < 2t} and Ui := {j : Ilz{i,j) < 
In the set Ti, using similar argument we know 1(112)7^ — (nx)^^!! < 0{s.,r\^ namax log 1 /t) , therefore 



IHtJi < in^.li < |n^. -n^.li + |n^s|li < o(e,/namaxiogi/T). 

Finally, for index j e Ui, in order for Ilz{i,j) be positive, it is required that Ilz{i,j) — Hz{i,j) > t/2. 
In this case, we have 

mzYuM<- {UzYu.-^k '<—■ 
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Case ao G From Lemma C.ll, wc sec that the resuhs hold when we rephice Smax with a-max- D 

Finally we would like to use the community vectors U and the adjacency matrix G to estimate the P 
matrix. 

Lemma A. 3 (Reconstruction of P). Let Q = ""^"^ ^nDiag(A)^ — ^j^^ll^^ and G is the adjacency matrix. 
When with probability 1 — 56, 

ep := max \{QGQ^K, - P,,| < O (^J^^l±^^aJ^aUL^og^) 
i,3e[n] \ y/n J 

Proof: Let Q = sa±l ^nDiag(a)-i - ^^11^). The proof goes in three steps: 

P « QH'^PUQ'^ « QGQ'^ w QGQ'^. 

First we show with high probability QIl^ is a fc x fc matrix that is very close to identity (note that Q is con- 
structed so that En [QH^] = I)- ForalH j^j, {QIl^)ij has absolute value at most 0{y/log{nk/6) ■ OLijajj ^Jn) 
with probability 1 — (5. This is because each entry in can be viewed as the sum of n independent variables 
(^ "'''~i^/"^''' nj,t) with mean 0, and variance bounded by 0{^/a^j/n) (recall E[n?J = on ■ < 
2ai/(ao + 1)). The bound follows from Bernstein's inequality and union bound. For diagonal entries, 
{QTC^)i.i can also be viewed as the sum of n independent variables ( "o+i '^'''"^^'^'''^ Ili.t) with mean 1/n 

and variance a^ ^^^/rt, therefore its difference with 1 is bounded by ^^"^ /^/n with high probability. Us- 
ing these bounds, the i-th row of QII^ is ej + where Aj is an error vector with i\ norm bounded by 
kyJ\og{nk/5)amax/amin/\/n with probability 1-5. 

Now consider the matrix QIl^ PliQ^ . When entries of Qli^ satisfy the bounds we derived above, we 
know the i,i-th entry of QH^PIIQ^ is close to Pij, with probability 1 — ^, 

(Qn^PnQ^)i,,- - Pij = {ei + Ai)^P(e,- + 5j) - Pij 

= AjP{ej+Aj) + eJPAj 

< 2\Ai\i + \Aj\i 

^ Q I • y/amax/amin L nk \ 
~ y \/n V (5 y ' 

where we used the fact that Pj,j e [0, 1]. 

Next we show QYl^ PHQ^ is close to QGQ^ . This holds due to the properties of Dirichlet distribution. 
Note that the £2 norm of each row of Q is bounded by 0{yy\og{nk / 5) / (ncii)) with probability I — S. On 
the off diagonal entries, E[G] = II^PII, and conditioned on 11, each entry of G is an independent Bernoulli 
variable, therefore the total variance is bounded by 0(logn/namin). By Bernstein's inequality with high 
probability the deviation is bounded by O (log {nk/S)/^/ nSmin ) with probability 1 — S. For diagonal entries, 
Gi,i = 0, and (n^Pn)^,, < 1. So the error in diagonal entries is at most the inner product of two rows in Q, 
which is equal to 0(log n/nSmin) (note that the diagonal entries do not dominate the error). 

In the last step, we replace Q by its estimate Q. We shall write as + Aj, where Aj is the £1 norm 
perturbation of row i then 

{QGQ^)ij - {QGQ^)ij = iQ' + A,)G(Q^' + A,)^ - Q'G{Q^)^ 

= A,G{Q^ + Aj)^ + Q'GA] 
<0(|A,|i + |A,|i) 

We now bound |Ai|i. With probability 1 — 25 

|A,|i = \Q' - Q% < (a^[„£.v^+ Iff |i£T3-[f ) 

/ (ao + 1)^ 1 V2 w 
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where the first term is chic to £i guarantees on row-wise perturbation between 11 and 11 in Lemma A. 2 and the 
second term comes from eigenvalue perturbation in Lemma 4.2. We use the fact that since E[|n*|i] = nSj, 
|n*|i « nSi with high probabiUty. By expanding the expression we see that the last term dominates all the 
other terms. Therefore, the result holds. □ 

A. 3 Zero-error support recovery guarantees 

Recall that we proposed Procedure 3 to provide improved support recovery estimates in the special case 
of hemophilic models (where there are more edges within a community than to any community outside). 
We limit our analysis to the special case of uniform sized communities (a^ = 1/fc) and matrix P such that 
= P^i^ = j) + Q^{i 7^ j) s-iid p > q. In principle, the analysis can be extended to hemophilic models 
with more general P matrix (with suitably chosen thresholds for support recovery). We first consider analysis 
for the stochastic block model (i.e. limao — *• 0). 

Proof of Corollary 4-5: Since the threshold ^ is 1/2 in case of stochastic block models, we claim that the £i 

error for rows of II is at most O(e^) since G {0, 1}, and in order for our method to make a mistake, it 

takes 1/4 in the £| error. Thus, we view this as an approximation for community memberships (except for 
0(£^) mistakes). 

Notice that the next step (assigning communities for vertices in A) uses a completely different set of 
edges, therefore conditioned on 11, this step is independent with all previous steps. For each vertex we 
compute the average number of edges from this vertex to all the approximate communities, and set it to 
belong to the one with largest average degree. In order for this process to succeed, the error caused by the 
mistakes in approximate community (which is bounded by {p — q)0{s^ ■ k/\A\) because each mistake cause 
difference p — g in the expected number of edges), must be smaller than the difference in the expected average 
number of edges (which is equal to (p — q)). At the same time, we need the average degree to constant well 
around its expectation, but this follows directly from Bernstein's inequality because the variance is bounded 
by 0{pk/n). Combining these requirements, we know the algorithm works when 0{{p — q)e'^k/n) <{p — q), 
which implies 

□ 

We now prove the general result. 

Proof of Theorem, 4-4' We first show that the Fc computed by the algorithm is entry-wise 0{{p — q)ep) 
close to the true Fc matrix. 

The proof is almost identical to the proof of A. 3. The Qb we have here is very similar to the matrix 
Q defined in A. 3. Except that they have different support, the only difference is that we are making sure 
the £i norm of the first term is (ao + 1): while in A. 3 it only has expected £i norm 1. Since the £i norm 
concentrates well this will be a low-order term in the final bound. 

Similarly we define Q% ^ {ao + + JsT^ ■ 

As in the proof of Lemma A. 3 we have that IIbQJ is a matrix that is close to /, and 

UnlPiiBQDij - (nlP)ij\ < HhIpyhqb), - e,)| = |(n5P)^A,-| 



n 




The reason this bound is factor p smaller is because the entries in (IlJP)' are upperbounded by p instead 
of 1. Like before this is actually a low order term. 

Now we show if we replace Qb with Qb here, the difference is also small using jQ^ ~ Qb\i — ^{^p) 
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mlPILnQDij - {IllPUBQDijl = lOllPIlByiQl - Ql)j\ 

< (maxnJPHB - mmUlPUB)\{Ql - Ql)j\i 
<0{{p-q)ep). 

Here we are using the fact that {Qg — Qg)l = 0, so we can subtract minllJPIlB from all the entries of 
YIqPUb without influencing the result. That is also why we normalize the vectors Qb and Qb (note that 
it is similar to computing the "average degree" instead of degree in the stochastic block model case). 

Finally, |(Gc,BQij)i,j(nJPnB(3]j)ij | are small by standard concentration bounds (and the differences 
are of lower order). Combining these we know l-Fc'(j) *) ^ Fc{j, i)\ < 0{{p — q)£p). 

Similarly, we can show \Pij — Pi.j \ < 0{{p — q)sp) here. Which means the average of diagonals of P is 
0{{p — q)ep) close to p; the average of off-diagonals of P is 0{{p — q)ep) close to q. 

The entries of Fc are close to entries of Fc, and we know ^ ep, therefore all the entries in Fc that 
are larger than g + (p — q)(^ must be found by our algorithm, and none of the entries that are smaller 
than q+ {p — q)i/'2, will not be called large by our algorithm. Finally the lemma follows because 11^,^ = 
{Fi,i-q)/{p-q). ' □ 



B Tensor Power Method Analysis 

In this section, we leverage on the perturbation analysis of Anandkumar et. al. [4]. However, we obtain 
stronger guarantees here through two modifications: (l)we modify the tensor deflation process in the robust 
power method in Procedure 2. Rather than a fixed deflation step after obtaining an estimate of the eigenvalue- 
eigenvector pair, in this paper, we deflate adaptively depending on the current estimate, and (2)rather than 
selecting random initialization vectors, as in [4], we initialize with vectors obtained from adjacency matrix. 
In Section B.l, we establish guarantees under "good" initialization vectors. This involves improved error 
bounds for the modified defiation procedure provided in Section B.2. In Section C.5, we establish that under 
Dirichlet distribution (for small ao), we obtain "good" initialization vectors. 



B.l Analysis under good initialization vectors 

We now show that when "good" initialization vectors are input to tensor power method in Procedure 2, we 
obtain good estimates of eigen-pairs under appropriate choice of number of iterations A'' and spectral norm 

e of tensor perturbation. 

Let T = Xlie[/c] ^i^i^ where Vi are orthonormal vectors and Ai > A2 > . . . Afc. Let T — T + E he the 
perturbed tensor with \\E\\ < e. Recall that N denotes the number of iteration of the tensor power method. 
We call an initialization vector u to be (7, i?o)-good if there exists Vi such that {u, Vi) > Rq and | {u, Vi) \ — 

maxj<i I (u, Vj) I > 7I Vi) \. Choose 7 = 1/100. 

Theorem B.l. There exists universal constants Ci,C2 > such that the following holds. 

e < Ci • A^i„i?2, AT > C2 • (\og{k) + log log (^h^^ ^ , (48) 

Assume there is at least one good initialization vector corresponding to each Vi, i £ [k]. The parameter ^ for 
choosing deflation vectors in each iteration of the tensor power method in Procedure 2 is chosen as ^> 25e. 
We obtain eigenvalue-eigenvector pairs (Ai,t)i), (A2,'02), • • • , (A/s,Ufc) such that there exists a permutation n 
on [k] with 

\\v^U) -Vj\\ <8e/X^(^j), |A^(j) - Aj-| < 5e, Vj G [fc], 
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and 



< 55e. 



Proof: The proof is on lines of the proof of [4, Thm. 5.1] but here, we consider the modified deflation 
procedure, which improves the condition on e in (48). We provide the hfll proof below for completeness. 

We prove by induction on i, the number of cigcnpairs estimated so far by Procedure 2. Assume that 
there exists a permutation tt on [k] such that the following assertions hold. 



Xj\ < 12e. 



1. For all j < i, \\v^^j) - Vj\\ < 8e/A^(j) and |A^(j) 

2. D{u, i) is the set of deflated vectors given current estimate of the power method is u e 5'^"^: 

D{u,i;^) := {j :\Xiei\>^}n\i], 

where 6i := {u,Vi). 

3. The error tensor 



satisfies 



E 



r,«i3 



E 



A 



,,«i3 



U) 



< 56e, Vue5'=-i; (49) 
||^i+i,„(/,u,u)|| < 2e, Vw e 5*=-^ s.t. 3j>i + l. {u^v^^j)f > 1 - (168e/A^(,))^ (50) 

We take i as the base case, so we can ignore the first assertion, and just observe that for i = 0, 
D{u, 0; = and thus 

k 

j=i 

We have \\Ei\\ = \\E\\ = e, and therefore the second assertion holds. 

Now fix some i e [k], and assume as the inductive hypothesis. The power iterations now take a subset of 
j e [i] for deflation, depending on the current estimate. Set 



Ci := min{(56 • 9 • 102)-\ (100 • 168)-\ A' from Lemma B.2 with A = 1/50} . 



(51) 



For all good initialization vectors which are 7-separated relative to niay^)-, we have (i) \9^^^ q| > i?o, and 
(ii) that by in [4, Lemma B.4] (using e/p 2e, n := 1, and i* := 7r(jmax), and providing C2), 

(notice by definition that 7 > 1/100 implies 70 > 1 — 1/(1 + 7) > 1/101, thus it follows from the bounds on 
the other quantities that e = 2pe < 56Ci • Amin-Ro < 2(i+'8k) ' ' ^i'fi necessary). Therefore ^jv := ^ 
must satisfy 



fi{6N,6N,6N) = mji^fi{e^j^\eP,eP) > maxA,(,-) 



5e = KUrn.^) - 5e. 



On the other hand, by the triangle inequality, 

Ti{9N,0N,0N) < E'^'^(j)^^(i).-'V + \Ei{9N,9N,9N)\ 



j>i 



j>i 

< Ku')\0wU'),n\ + 56e 
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where j* := argmaxj>i A^(j)|^^(j)^jv|- Therefore 
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5e- 56e > -A 



Squaring both sides and using the fact that ^^(j.) jv + jv — 1 for any j ^ j*, 



2 16 2 16 i 

(A,rO*)^7rO*),Jv) > ^(•^7rO„.ax)^7r(j*),Jv) + ^ (''^7rO„a=.)^7rO),Jv)' 

16 / s 2 16 / s 2 



which in turn implies 

3 

Ku)\(^7rU)M < -^K(j')\dTrU*),N\^ j f ■ 

This means that is (l/4)-separated relative to 7r(j*). Also, observe that 

" 4 



' 5 •^7r(j*) ^ A 



Therefore by [4, Lemma B.4] (using e/p := 2e, 7 := 1/4, and k := 5/4), executing another A'' power iterations 
starting from 6n gives a vector 6 that satisfies 

8e 



A, 



0*) 



|A- A„(j.)| < 5e. 



Since Vi = and Ai = A, the first assertion of the inductive hypothesis is satisfied, as we can modify the 
permutation tt by swapping 7r(i) and 7r(j*) without affecting the values of {7r(j) : J < « — 1} (recall j* > i). 

We now argue that has the required properties to complete the inductive step. By Lemma B.2 

(using e := 5e, ^ = 5e = 25e and A := 1/50, the latter providing one upper bound on C\ as per (51)), we 
have for any unit vector u G S^~^ ^ 

E(^-0)<')-^^-^f))(^'"'") ^ (1/50+ 100 ^(uXo))') ' 5e<55e. (52) 
j<i ' ^ 7=1 ' 



Therefore by the triangle inequality, 

||^i+i(7,u,u)|| < \\E{I,u,u) 



2<i ^ 



< 56e. 



Thus the bomid (49) holds. 

To prove that (50) holds, for any unit vector u G S'^^^ such that there exists j' > i + 1 with (■u^V7r(j'))^ > 
1 — (168e/A^(j'))^. We have (via the second bound on Ci in (51) and the corresponding assumed bound 

e < Cl • Ainin.Ro) 



100 



E(«"«-0))' < 100(1 - i^"MJ')f) < 



100 



168e 



A. 



< 



50' 



and therefore 



( l/50 + 100^(w^^;^Q))M 5e< (1/50 + 1/50)^/25^ < g_ 

By the triangle inequality, we have ||i?i_(.i(/, u, m)|| < 2e. Therefore (50) holds, so the second assertion of 
the inductive hypothesis holds. We conclude that by the induction principle, there exists a permutation tt 
such that two assertions hold for i = k. From the last induction step {i = k), it is also clear from (52) that 
\\T — Y^'j^i Ajt)®^|| < 55e. This completes the proof of the theorem. □ 
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B.2 Deflation Analysis 

Lemma B.2 (Deflation analysis). Let e > and let {vi, . . . , Vk} be an orthonormal basis for M*^ and Aj > 
for i G [k]. Let {vi, . . . ,Vk} gMJ' be a set of unit vectors and Aj > 0. Define third order tensor Si such that 

For some t G [k] and a unit vector u G S''~^ such that u = ^^^^[fe] diVi and 6i := {u,Vi), we have for i G [t], 

M\ > C > 5e, 
|A. -A,| <e, 

ll^j < m.m{V2, 2i/Xi}, 

then, the following holds 



i=l 



J2£i{I,u,u) < (4(5 + lle7A^i„)2 + 128(l + e7A„i„)2(g/A„i„)2Jg2^ft 



+ 64(1 + e/A^in)^e2 + 2048(1 + g/A^in)'?^ 



In particular, for any A e (0, 1), there exists a constant A' > (depending only on A) such that e < A'Amin 
implies 

t 2 / * \ 

Y,£iii,u,u) < (A+ioo^^?n~2_ 

Proof: The proof is on lines of deflation analysis in [4, Lemma B.5], but we improve the bounds based on 
additional properties of vector u. Prom [4], we have that for all i G [t], and any unit vector u, 

t 1 , \ * 

^ £i(I, u,u) < ( 4(5 + lle/A„,i„)' + 128(1 + ll\^,^f{ll\^i^f ) ^ dl 
i=l 2 V / 

t / i X 2 

+ 64(1 + e7A^in)'?2 Yj^llXf + 2048(1 + ll\^,^f~^ ( ^(eVA^)' ) • (53) 

Let Ai = Aj + (5j and Oi = 9i+ We have Si < i and /3i < 2e/Ai, and that \Xi9i\ > ^. 

||Aj^i| — |Aj^j|| < \Xi9i — XiOi\ 

< \{K + Si){e, + 13,) - Xi0i\ 

< \6iei + Xip^ + Sipi\ 

< 4e. 

Thus, we have that |Ai6'i| > 5e - 4e = e. Thus Y.Ui ^V-^l < Ei < 1- Substituting in (53), we have the 



result. 



□ 



C Concentration Bounds 

C.l Main Result: Third Moment Tensor Perturbation Bound 

Notation: Let ||T|| denote the spectral norm for a tensor T (or in special cases a matrix or a vector). 
Let ||M||i? denote the Frobenius norm. Let |Mi| denote the operator £i norm, i.e., the maximum ii norm of 
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its columns and ||M||oo denote the maximum £i norm of its rows. Let k{M) denote the condition number, 

1„ l|M|| 
' o-„i„(M)- 

Unless otherwise specified, throughout the statements made mean that they occur with probability 1 — (5 
for a sufficiently small i5 > 0. Moreover, the probability space is the product space of the node communities, 
drawn from the Dirichlet distribution, and the edge variables drawn from the Bernoulli distribution Ber(Pij) 
given the communities are i and j for the two edge points. Let O denote O(-) up to poly log factors. We 

write n ~ Dir(a) to mean 7ri*~ Dir(a), for i eV. 

We now provide the main result that the third-order whitened tensor computed from samples concen- 
trates. Recall that Ty°_^^^ ^ denotes the third order moment computed using edges from partition Y to 

partitions A,B,C in (12). Wa,WbRab,WcRac are the whitening matrices defined in (21). The corre- 
sponding whitening matrices Wa, WbRab, WcRac for exact moment third order tensor E[T^°_^^^ ^ ^-^ \H] 
will be defined later. Recall that p is defined in (22) as 

ao + 1 
P ■■= — • 



Given 6 G (0, 1), throughout assume that 



n = n{pHog''^). (54) 



Theorem C.l (Perturbation of whitened tensor). When the partitions A,B,C,X,Y satisfy (54), we have 
with probability 1 — 100^, 



T?^{A.,B,C}i^A'^BRAB,WcRAc)-nT7^[A,B,C}iWA,WB,W^^^^ 

^^f («o + l)V(max,(Pa)0 L ^ f P^^^^2 kY^'\ , /bifc ' 



= 0(-^--^], (55) 



C 

- V Ctmax 

C.2 Whitening Matrix Perturbations 

Consider rank-fc SVD of |^r^''^(Gx.A)fe-™d = UaDaVJ, and the whitening matrix is given by Wa '■= 
UaD~^ and thus \X\~^Wj^{G'^ {G^' j^)Wa = I- Now consider the singular value decomposition of 

|X|-iH^lE[(G^°^)T|n] . EiiG^AmWA = <S^D^^. 

Wa does not whiten the exact moments in general. On the other hand, consider 

Wa := Wa^aD^^^^^a- 
Observe that Wa whitens |X|-V2E[(G^°^)|n] 

\X\-'wjE[{G2AV\m[iG2AmWA = ($a^^'/'$I)^*a^a1>Ii>a^-^/^$t ^ I 

Now the ranges of Wa and Wa may differ and we control the perturbations below. 
Also note that Ra,b, Ra,c are given by 

Rab := \X\-'W^{G2B)J-svdiG2A)k-s.dWA. (56) 
Rab := \X\-'W^E[{G2bV\U] ■ E[G^°^|n] • Wa. (57) 

Recall ea is given by (61), and cTmin (E[G^«^|n]) is given in (C.8) and \A\ = \B\ = \X\ = n. 
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Lemma C.2 (Whitening matrix perturbations). With probability 1 — 5, 

ew^ := II Di&g{ay/^Fj{WA - Wa)\\ = O I 

:= II Dia.g{ay/^F^{WBRAB - WbRab)\\ =0\^ 




(59) 



(58) 



Thus, with probability 1 — 66, 




• (1 + £l + £2 + £3) 



(60) 



where e\, £2 and £3 are given by (67) and (68). 

Remeirk: Note that when partitions X,A satisfy (54), £i,£2,£3 are small. When P is well conditioned 
and Smin = Smax = we have cwa^^Wb ^ 0{k/s/n). 



Proof: Using the fact that Wa = Wa^aDj^''^^\ or Wa = Wa^aD^a^'^^ ^e have that 

II Ui&g{a)^'^Fi{WA - Wa)\\ < II Bia.giay^^ fJWa{I - ^aDT^a)\\ 



using the fact that Da is a diagonal matrix. 

Now note that Wa whitens |X|-V2E[G«o^|n] = \X\-^/^FA'Dmg{a^^^)^x, where is defined in (66). 
Further it is shown in Lemma C.8 that satisfies with probability 1 — 5 that 



Since £1 ^ 1 when X, A satisfy (54). We have that |X| ^^^"^x bas singular values around 1. Since Wa 



\\mag{ay^'FjWA\\=0{{l-e,)-'^'). 
Let E[(G^%)|n] = (Gj%)fc_,.d + A. We have 

||/-^a|| = ||/-$^^^$;^|| 

= \\i-\x\-'wjniG2AV\^]-niG7,A)\mA\\ 

= o {\xn\wj (a^(G^«^),_,w + A(G«<;^)J_,,,) WaW) 



= \\mag{a)'/'FjWA{I-DT)\\ 

< II Biag{a)'/'FjWAiI - DT)iI + dT)\\ 

<||Diag(a)i/2Fjt^^||.||/-^^|| 




whitens |X|-i/2E[G^«^|n], wc have 



\X\-^WjFAT)iag{a^/^)^x^]c Diag(ai/2)FjWA = /. 



Thus, with probability 1 — 6, 



= 



= 



= 



(\x\-'/^\\wJa^Va + VJAWaW) , 
(\X\-'/^Wa\\\\A\\) 

(iXr/^WAWec) , 
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since ||A|| < cq + o-k+iiC^ j^) < 2eG, using Weyl's theorem for singular value perturbation and the fact that 

ec • II^^aII < 1 and \\Wa\\ = \X\^/V<J^in (e[G^'; ,,|n]) . 

We now consider perturbation of WbRab- By definition, we have that 

EfG^P^ln] • WbRab = ]E[G^%|n] • Wa. 

and 

WWbRabW = |X|i/V„,i„(E[G'^''3|n])-i. 
Along the lines of previous derivation for cwa ) let 

\X\-\WbRabV ■ n{G^°BVm ■ E[C?J°s |n] W^B^AB = ^bDb^I- 
Again using the fact that ~ I, we have 

II Diag{a)'^^ F],WbRab\\ « || Diag(a)i/2^1w^^||, 
and the rest of the proof follows. □ 

C.3 Proving the tensor concentration bound 

Proof of Theorem C.l: In tensor T"" in (12), the first term is 

(ao + l)(ao + 2) ^ {Gl^ ® GIj, GIc) . 

We claim that this term dominates in the perturbation analysis since the mean vector perturbation is of 
lower order. We now consider perturbation of the whitened tensor 

^0 = ^ E ((^JgIa) ® (RIbW^gIb) ® (RlcW^Glc)) ■ 

We show that this tensor is close to the corresponding term in the expectation in three steps. 
First we show it is close to 

Then this vector is close to the expectation over Ily- 

A2 = E,^Dir(a) {{WJFat:) ® (^IbW^Fbtt) ® {RlcW(!: Fct:)) . 
Finally we replace the estimated whitening matrix Wa with Wa- 

A3 = Kr.Uir(a) {{WJFaTt) ® {W^ FbTt) {W^ FcTt)) . 

For Ao — Ai, the dominant term in the perturbation bound (assuming partitions A, B, C, X, Y are of size 
n) is (since for any rank 1 tensor, t; iv\\ = \\u\\ ■ \\v\\ ■ \\w\\), 



1 



E {^JgIa - WJFatt?! 



ieY 



^, 1 — 1 (ao + l)(max^(PS)i) j n\ 
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with probability 1 — 136 (Lemma C.3). Since there are 7 terms in the third order tensor T"°, we have the 
bound with probabihty 1 — 91(5. 

For Ai — A2, since W^-Fa Diag(a)^/^ has spectral norm almost 1, by Lemma C.5 the spectral norm of 
the perturbation is at most 

|WAfADiag(a)V2 

< 0{l/aminVn\/\ogn/6). 
For the final term A2 — A3, the dominating term is 

m - W^)F^ Diag(a)V2 \\A,\\ < || A3II < O ( («o + iVmax.(Pa). + s,+e, + es).!^] 

Putting all these together, the third term ||A2 — A3II dominates. We know with probability at least 1 — 1005, 
the perturbation in the tensor is at most 

^ -— (l+£i+£2 + £3)Wlog^ . 

□ 



— ^(Diag(S)-V27,.)®3 _ E,.Di,(„)(Diag(a)-V2^,)' 



Lemma C.3 (Concentration of sum of whitened vectors). Assuming all the partitions satisfy (54), with 
probability 1 — 75, 



ieY 



0{y/\Y\^^ewA) 



ieY 



O 



= 



\J («o - 


h l)(max,(Pa)i) 




nin'^min (-^) 


•\/(ao - 


h l)(max,(Pa)i) 



and Smin = c 

0{k). Thus, when it is normalized with 1/|F| = 1/n, we have the bound as 0{k/n). 



• (1 + £2 + ss)y/logn/Sj , 

• (1 + £1 + £2 + e3)-\/log n/5 . 



Remark: Note that when P is well conditioned and a-^nin = S,„ax = 1/fc, we have the above bounds as 



Proof: Note that Wa is computed using partition X and Gi^A is obtained from i E Y. We have indepen- 
dence for edges across different partitions X and Y. Let Sj := ^^^^(G^a" -Fa TTj). Applying matrix Bernstein's 
inequality to each of the variables, we have 

<\\WA\\V\iF4'i, 
from Lemma C.7. The variances are given by 

II EE[5iS7|n]|| <J2wlBmg{FA7ri)WA, 



ieY 



ieY 



< ||Wa||'||-FV| 



o 



\Y\ (aa + 1) (max, (Pa)i) 



J2 . 0-2. (P) 



(l+£2 + £3) , 
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with probability 1 — 2J from (64) and (65), and £2. £3 arc given by (68). Similarly, || ^^^-^^ E[S^Sj|n]|| < 
||Wa|P||-PV||i- Thus, from matrix Bernstein's inequality, we have with probability 1 — 35 

W^EiW = 0{\\WA\W^a^{\\FA\\u \\Fx\\i)). 

ieY 



r, ( V("o + l)(maxi(Pa i r -\ 

= O — (1 +£2 +£3)Vlogn/5 

y Q!minCTmin(-r) j 

On similar lines, we have the result for B and C, and also use the independence assumption on edges in 
various partitions. □ 

We now show that not only the sum of whitened vectors concentrates, but that each individual whitened 
vector WJ^GJ concentrates, when A is large enough. 

Lemma C.4 (Concentration of a random whitened vector). Conditioned on 11 matrix, with probability at 
least 1/4, 

-^(ao + l)(maxj(Pa)i) \ 



-1/2^ 



o 



Proof. We have 

\wIgIa - WlFATTi\ < \{Wa - WAfFAT^i\ + \wI{GIa - WIFa-k) 
The first term is satisfies satisfies with probability 1 — 35 

\\{Wj^Wj)FA^,\\<ew.K^J^ 



rt ( (qq + l)amaxv/ (max,(Pa)j) 

TJ^2 7^^ {1 +81+62+ S3) 

Now we bound the second term. Note that Gjj^ is independent of Wj, since they are related to disjoint 
subset of edges. The whitened neighborhood vector can be viewed as a sum of vectors: 

WJgIa = E GiAwDj = GijiDAUj)j = DaY, Gij{Ul)j. 
jeA jeA jeA 

Conditioned on tt, and Fa, Gij are Bernoulli variables with probability (Fa 7ri)j. The goal is to compute 

2 

the variance of the sum, XljeA(-^^'''»)j (^J )i > ^^'^ then use Chebyshev's inequality. 

By Wedin's theorem, we know the span of columns of Ua is 0{€g /<7min{G'x , A)) = 0{ewA) close to the 
span of columns of Fa- The span of columns of Fa is the same as the span of rows in Ha- In particular, let 
Proju be the projection matrix of the span of rows in IIa, we have 

UAUj-Projn\\ <0{€wa). 

Using the spectral norm bound, we have the Frobenius norm 



UaUJ - Projn 
since they are rank k matrices. This implies that 



jeA 
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Now 



O-min(nA) 



o 



from Lemma C.8 

Now we can bound the variance of the vectors X^^g^ since the variance of Gij is bounded by 

{FA7Ti)j (its probability), and the variance of the vectors is at most 



jeA jeA 



Profnf + 2Y,{FA7Ti)j {\\{U^)j\\ - \\ProA 



< 2 > (FATTi)-i max 
tr; jeA 
jeA 

f\FA\i{ao + l) 



jeA 

|2 



jeA 



< O 



Now Chebyshev's inequality implies that with probabihty at least 1/4 (or any other constant), 



jeA 



^c( ^FA\i{ao + l) 



nar, 



And thus, we have 



Wlid^A - FA-Ki) < 



|i=:4|i(ao + l) 



nan 



Wi 



< 



Combining the two terms, we have the result. 



□ 



Remark: This lemma is used to show that with some probability W^Gi^A is a good initial vector for 

tensor power method. In order to achieve that wc need to combine this Lemma with Lemma C.IO. 

Notice that there are three events here: (1) is small and \Fa\i is concentrated around its expectation; 



(2) 



W^Fatt, 



is small; (3) WJ^Fatti is a (7,ro) good initial vector. 



When all three events happen we can conclude W'^'Gi^A is (7 — 



2A 



ro-A '^o ~ good. Event (1) happens 
with high probability. Event (3) is only related to properties of 11. The proof above shows conditioned on 
Event (1) and any value of 11, the probability that Event (2) happens for index i is always at least 1/4, 
therefore the intersection of three events happens with significant probability. 



Lemma C.5. With probability 1 — 6, 



5^(Diag(a)-VVi)®' - E,^Dir(a)(Diag(a)-VVi)« 



ieY 



<o{. 

= 0{- 



1 



1 



an 



Proof. The spectral norm of this tensor cannot be larger than the spectral norm of a fc x fc^ matrix that 

we obtain be "collapsing" the last two dimensions (by definitions of norms). Let (fn = Diag(Q:)~^/^7ri, the 



as a vector in R*^ ) . We apply Matrix 



"collapsed" tensor is just the matrix (j)i{4>i (S) (here we view 
Bernstein on the matrices Zi = ® (j^i)^ ■ 

Clearly, \\Y.^eY^[ZiZJ]\\ < |F| max ||E[,^0T] || < |F|a-2„ because ||E[00T]|| < 2. 

For the other variance term ||X]iey || > we have \\Y.i^Y H^I ^ilW ^ l^l^mm ||]E[((/) (g) ^)((/) (g) c/))^] 
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It remains to bound the norm of E[((^ ® (f)){(j) (8) </>)''']■ Suppose we have a variable X = ^ ■ j Aij(j){i)(j){j), 
then by definition the spectral norm of E[{(j) ®<p){(l)(E) (/))^] is the maximum expectation of X"^ among all 
matrices A such that \\A\\f = 1- 

We group the terms in E[X^] according to the powers (j> variables, and then bound the contribution of 
different groups of terms separately, and E[X^] is at most the sum of absolute values of different groups. 

terms: By properties of Dirichlet distribution we know E[0(i)^] = Q{a~^) < 0(a~J„). The 
coefficients in front of ^(i)'' is Af-, so we know 

k 

4'{i)^4'{j) terms: Ignore symmetries (there are constantly many so we lose at most a constant here) the 
coefficients are Ai^iAij. The expectation of 4'{'>-Y4'U) is 6(a(i)~^/^d(j)^/^). Therefore, the total contribution 
of these terms is bounded by 



4>{i)'^4'{3Y terms: the total number of such terms is O(fc^), the expectation of (f){i)'^ cp{j)^ is 0(1), so the 
Frobenius norm of that part of matrix is smaller than 0{k) (which implies |E[^^ j{Afj+2Ai iAj j)(t>{i)'^4'{j)'^]\ < 
Oik)). 

(t){ii)'^4>{i2)(f){h) terms: similarly, there are O(fc^) such terms, each one has expectation 9(d(i2)^/^a(i3)^/^). 
The Frobenius norm of this part of matrix is bounded by 



O / E «(*2)a(*3) < 0{Vk) Yl E "^^"^3 < 0{Vk). 

the rest: the sum is E[^^ Ap^gd(i)^/^Q;(j)^/^d(j3)^/^d(g)-^/^]. It is easy to break the bounds into 

the product of two sums Sp,g) then bound each one by Cauchy-Schwartz, the result is 1. 

Hence the variance term in Matrix Bernstein's inequality can be bounded by cr^ < 0{na^^^), each term 
has norm at most When < n wc know the variance term dominates and the spectral norm of 

the difference is at most 0(a^^„n^ -^/log n/d) with probability I — S. 

C.4 Concentration of adjacency matrix and other auxilictry lemmata 

Let n := max(m, \X\). 

Lemma C.6 (Concentration of G^.a)* When iTi Dir(Q;), for i gV, with probability 1 — 46, 



□ 



:= \\G2a - niG^A) ' |n]|| = O ( ^/(ao + l)n • (max(Pa),)(l + 62) log - ) (61) 



Proof: From definition of have 



ec < V^^^WGx.A - E[Gx,A|n]|| + (V^i^ - l)^/\X\\\fix,A - E[^xx,AM\■ 
We have concentration for fix,A and adjacency submatrix Gx,A from Lemma C.7. □ 

We now provide concentration bounds for adjacency sub-matrix Gx,A from partition X to j4 and the 
corresponding mean vector. Recall that E[ij,x^a\Fa,''^x] = Fat^x and E[/xx^.a|-Fa] = FaS. 
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Lemma C.7 (Concentration of adjacency submatrices). When 7r/~ Dir(a) for i gV, with probability 1 — 2S, 



\\Gx,A - E[Gx,aM\\ =0[Jn- (niax(Pa)i)(l + £2) log ? ) • (62) 



Ima - ]E[MA|n]|| = O (^y^ • (max(Pa)i)(l + £2) log ^ j , (63) 



where £2 «s .9«wen &?/ (68). 



Proof: Recall E[Gx,A|n] = FaIIx and Ga.jc = Ber(PAnx) where Ber(-) denotes the Bernoulli random 
matrix with independent entries. Let 

Zi := {GIa - FA^i)eJ. 

We have ^ — Fa^x = X^jgx ^i- We apply matrix Bernstein's inequality. 

We compute the variances Y.^^.ZiZJ and Y^i^W Zi\n]. We have that Y.i^[ZiZj only the 
diagonal terms are non-zero due to independence of Bernoulli variables, and 



entry- wise. Thus, 



E[Z,z7|n] <Diag(F^7ri) (64) 



j|^E[ZiZ7|n]||<ma« Ma,b)7ri{b) 

jeJ>f ^ ' iex,6e[fc] 

= max \^ FA(a, 6)nx(6, i) 
aeffe] ^-^ 
iex,belk] 

<max V P(c,6)nx(6,rf) 

ce [fc] ■'^ , , 

= ||Pn;,|U = ||F;f||i. (65) 

Similarly Eiex^l^^^i] = Eiex Diag(E[||G7^ - P^TTilp]) < ||P^||i. From Lemma C.12, we have ||Px||i = 
0{\X\ ■ (maxi(Pa)i)) when |X| satisfies (54). 

We now bound \\Zi\\. First note that the entries in Gi^A are independent and we can use the vector 
Bernstein's inequality to bound \\Gi^A — FATTi\\- Wc have maxjgA \Gij — (PA7ri)j| < 2 and Yj^[Gij — 
(PA7rj)j]^ < J^ji^AT^i)] < \\Pa\\i- Thus with probability 1 — J, we have 

WGi^A-FATTiW < (1 + V81og(l/(5))Vll-FA||i + 8/31og(l/(5). 



Thus, we have the bound that || Ej Zi\\ = 0(max(-\/||PA||i, -\/||-Fx||i))- The concentration of the mean tcirm 
follows from this result. □ 

We now provide spectral bounds on E[(GJ°y^)''"|n]. Define 

i>i := Diag(d)-V2(V^^;ri - {^/^^ - 1)^)- (66) 
Let ^x be the matrix with columns tpi, fov i G X . We have 

E[(Gj%)^|n] = FADiagiay/^'fx, 

from definition of E[(G^°^)^|n]. 

Lemma C.8 (Spectral bounds). With probability 1 — 6, 



£1 := 11/ - IXr^x^lW <olJ . log ^ ] (67) 



43 



With probability 1 — 26, 

||E[(G^«^)^|n]|| = O (||Pi|a„.ax\/l^ll^l(l+^i+^2)) 



where 

Remeirk: When partitions X,A satisfy (54), ei,e2,£3 are small. 

Proof: Note that ipi is a random vector with norm bounded by 0(i/ {ao + 1) /Smin) from Lemma C.12 and 
E['(/'i?/'7] = ^- We now prove (67). using Matrix Bernstein Inequality. Each matrix ijjiipj /\X\ has spectral 
norm at most O((ao + ^)/ctmin\X\). The variance is bounded by 



i-E[J2 mfi^i^J] < ^maxllVill'Ei^V.V-Z] 



< O((ao + l)/S„,i„|X|). 



Since O((ao + l)/ainin|-''^|) < 1, the variance dominates in Matrix Bernstein's inequality. 
Let B := \X\-'^^x'^x- We have with probability 1-6, 

cTmin(E[(G«°^)^|n]) = ^ \X\a,nin{FA Diag(d)i/2B Diag(d) V2FJ), 

= 0(VSn.in|^|(l-ei) • t^min(i^A)). 

From Lemma C.12, with probability 1 — 6, 



<7min(FA) > [ ^/^^ - 0((|A|logfc/(5)i/4) j • a,„i„(P). 
Similarly other results follow. □ 

C.5 Properties of Dirichlet Distribution 

In this section, we list various properties of Dirichlet distribution. 
C.5.1 Sparsity Inducing Property 

We first note that the Dirichlet distribution Dir(a) is sparse depending on values of a,, which is shown 
in [39]. 

Lemma C.9. Let reals t € (0, 1], aj > 0, ao '■= cei and integers 1 < s < k be given. Let {Xi, . . . , Xk) ~ 
Dir(a). Then 

Pr [\{i ■.Xi>T}\<s]>l- r-^Oe-^^+i^/^ _ g-4(.+i)/9^ 

when s + 1 < 3A;. 
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Wc now show that wc obtain good initiahzation vectors under Dirichlct distribution. 
Arrange the Sj's in ascending order, i.e. Si = Smin < 32... < = Smax- Recall that columns 
vectors WjGjj^, ioi i ^ A, are used as initialization vectors to the tensor power method. We say that 

Ui :— ^ t. '-^^ is a (7, i?o)-good initialization vector corresponding to j S [k] if 

\{ui,^j)\>Ro, |(wi,$j-)|-max|(ui,$„)| >7|(tii,$j)|, (69) 

where $j := aV^(FA)j, where {Fa)] is the j"" column of Fa := WJ^Fa- Note that the are orthonormal 
and are the eigenvectors to be estimated by the tensor power method. 

Lemma C.IO (Good initialization vectors under Dirichlet distribution). When 7r/~ Dir(a), and aj < 1, let 

A:=0 f-^) . (70) 



/nro 

For j G [k], there is at least one (7— ^^^^ , tq — A) -good vector corresponding to each ^j, for j € [k], among 
{ui}ie[n] with probability 1 — 95, when 

n = f2 (^Q,^i^en,ai/^,(ao+civ^)(2yt)'-oC2 log(fc/<5)) , (71) 

where ci := (1 + \/81og4) and 02 := 4/3 (log 4), when 

(1 - j)roa]^l{ao + (1 + ^8 log4)yfc^ + 4/3(log4)a-[f log 2k) > 1. (72) 
When ao < 1, the bound can be improved for ro G (0.5, (ao + 1)~^) md 1 — 7 > as 

n > ^ + log(fc/<5). (73) 

Remark when ao > 1) Q!o = ©(1): When ro is chosen as ro = ct^J^ {^/oq + ci\/k)~^ , the term 
e'-oSi4l(«o+civ^) = e, and we require 

n = ^ («;;Lfc° '' log(fcA)) , ro = a^J^i^o + ciVk)-\ (74) 

by substituting C2/C1 = 0.43. Moreover, (72) is satisfied for the above choice of ro when 7 = 0(1). 
In this case we also need A < ro/2, which implies 



C = (75) 

V Pkamax J 

Remark when ao < 1: In this regime, (73) implies that we require n = f^(S~Jjj). Also, ro is a constant, 
we just need ( = Oi-Jnj p). 

Proof: Define in := Wj FA7ri/||WjF^7ri||, when whitening matrix Wa and Fa corresponding to exact 

statistics are input. 

We first observe that if Ui is (7,ro) good, then Ui is (7 — ^."^^^ , ro — A) good. 

When Ut is (7,ro) good, note that WjFAiTi > ciml/xro because aminiWjFA) = oon^Ix and ||7ri|| > ro. 
Now with probability 1/4, we have 

max llwi — {till 
ie[fc] 

WjC^A - WAFA^^^\ I \W\FA-n,\ 
"mL V(ao + l)(max,(P3),) \ 



O 



ron^/'^a]^f^aminiP) 
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from Lemma C.4. Notice that when communities arc uniform A is proportional to Ewa- 

If we perturb a (7, ro) good vector by A (while maintaining unit norm), then it is still (7 — 7^^, ?'o — ^) 
good. 

We now show that the set {ui} contains good initialization vectors when n is large enough. Consider 
Yi r(aj,l), where r(-,-) denotes the Gamma distribution and we have Y/^^Yi ~ Dir(a). We first 
compute the probability that Ui := W^^FA7ri/||M^^ivi7ri|| is a (ro,7)-good vector with respect to j = 1 
(recall that Si = Smin)- The desired event is 



A, := (Sr^/Vi > TO IY^^JX') n (Sr^/Vi > ^ maxa-^/V,) 



(76) 



We have 
P [Ai] > P 

> P 

> P 

> P 



> r,t) n(E "7'^' < n ^ (1 - 7)^oiSi/'„) 
3 j>i 

^a7^i;?^<t^a7VV,<(l-7)roia:^,^„ 



for some t 



"min Yl > rot 



Smin^^l > rot 



maxF,- < (1 — j)rota. 
j>i 



1/2 

min 



maxK < (1 — ^)rota 



1/2 

min 



When aj < 1, we have 



¥[UjYj > log2fc] < 0.5, 



since P{Yj > t) < t"^ * < e * when t > 1 and aj < 1. Applying vector Bernstein's inequality, we have 
with probability 0.5 — e~"* that 

II Diag(a7'/')(y - E(y))||2 < (1 + V8^)^/k^ + A/3ma^,l^log2k, 

since EE,- Var(y,)] = kao since cxj = aj/ao and Var(y,) = aj. Thus, we have 



I Diag(a,. '/')F||2 < ao + (1 + V8^)vW+ 4/3maJf log2fc. 



since || Diag(aj ' )E(y)||2 = = cko- Choosing m = log 4, we have with probability 1/4 that 



Diag(a,. '/^)F||2 <t:=ao + {l + ^8 log 4) + 4/3(log 4)a log2A;, 



1/2 1 

min 



We now have 



from Lemma C.13. 
Similarly, 



"mln^^i > rot 



= ao + ci Vfcao + C2a^i^ log 2k. 



(77) 
(78) 



> 



4C 



Omin-l ^-1/2 



maxFj- < al[^{l - 'y)rot 



> 1 _ ^ ((1 _ ^)rota]l^^)^' ' e-(i-^)'-««-n* > 1 - fce-^^"^)'-''""-*, 
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assuming that (1 — 7)roa^i^f > 1. 

Choosing t as in (77), we have the probabiUty of the event in (76) is greater than 



16C 



2(2fc)(i-')')''<"=2- 



(2A;)'" 



{'^oalli^iao + Cl^/ka^+ C2aJl^ log 2k) 



Similarly the (marginal) probability of events A2 can be bounded from below by replacing amin with a2 and 
so on. Thus, we have 



^] = f2 an 



{2kyoc2 



for all me [k]. 

Thus, we have each of the events Ai,A2,---,Ak occur at least once in n i.i.d. tries with probability 



1 -: 



u ( n 

je[k] ie[n] 



>1-^P f]A^{i) 

je[k] \_ie[n] 
> 1- ^ exp[-nP(^,)], 



je[k] 



> 1 — k exp 



-ro Smax ("0 +ci Vkoo) 



where Aj{i) denotes the event that Ai occurs for i"* trial and we use that l — x<e ^ when x G [0, 1]. Thus, 
for the event to occur with probability 1 — ^, we require 



n = Q 



m'lx 



(ao+ciVsao) 



(2fc)''°^^ log(l/<5) . 



Improved Bound when < 1: Wc can improve the above bound by directly working with the Dirichlet 
distribution. Let tt ~ Dir(a). The desired event corresponding to j = 1 is given by 



Ai 



— 1/2 



I Diag(a. ^^^)7t 



Thus, we have 



W > 



(TTi > ro) fl [tt, < (1 - 7)ro) 

i>l 

> P[7ri > ro]P TTi < (1 - 7)ro|7ri > ro j 



since P (ni>i ^r^ < (1 — 7)ro|7ri > ro) > P (ni>i '"'i ^ (1 ^ 7)^0)- By properties of Dirichlet distribution, we 
know E[7ri] = and E[7r|] = Sj^^. Let p := Pr[7ri > ro]. We have 

EK^] = pE[TT^\Tri > ro] + (1 - p)E[7rf Itt^ < ro] 
<p+{l- p)roE[7ri|7ri < ro] 
<p+(l-p)roE[7ri] 
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Thus, p > ^ which is useful when ro(ao + 1) < 1. Also when tti > ro, we have that 

TTi < 1 — ro since tt^ > and tt^ = 1. Thus, choosing 1 — 7 = , we have the other conditions for ^1 
are satisfied. Also, verify that we have 7 < 1 when vq > 0.5 and this is feasible when ao < 1. □ 

We now prove a result that the entries of tt^, which are marginals of the Dirichlet distribution, arc likely 
to be small in the sparse regime of the Dirichlet parameters. Recall that the marginal distribution of tTj is 
distributed as B{ai,ao — cm), where B{a,b) is the beta distribution and 

F[Z = z]<x z''-'^{l- zf-^, Zr^B{a,h). 

Lemma C.ll (Marginal Dirichlet distribution in sparse regime). For Z ^ B{a,b), the following results 
hold: 



Case 6< 1, Cg [0,1/2]: 



Pr[^>C]<81og(l/C).-^ (79) 
E[Z ■ 5{Z <C)]<C- E[Z] = C ■ (80) 



Case 6 > 1, C < (6 + 1)"^: we have 

Pr[Z> C] < alog(l/C) (81) 
E[Z • 5{Z < C)] < 6aC (82) 

Remark: The guarantee for 6 > 1 is worse and this agrees with the intuition that the Dirichlet vectors 
are more spread out (or less sparse) when b = ao — ai is large. 



Proof. We have 



mz-siz<c)]=[ / ^. x^ii-xf-^dx 

Jo B[a,b) 



B{a,b) 
_ (1 - C)''-iC«+i 
~ {a + l)B{a,b) 

For E[Z ■ 5{Z > C)], we have. 



nZ-6iZ>C)]=f^^^x'^ 



(1 - xf-^dx 



{a, b) Jc 



'c 

br<a 



B{a,b) 
_ (1 - CfC 
~ bB{a, b) 

The ratio between these two is at least 

E[Z ■ S{Z > C)] {1-C){a + 1) 1_ 
E[Z-S{Z<C)] - bC - C 

The last inequality holds when a, 6 < 1 and C < 1/2. The sum of the two is exactly E[Z], so when C < 1/2 
we know E[Z ■ 5{Z <C)\<C- E[Z\. 
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Next wc bound the probability Py[Z > C]. Note that Pi[Z > 1/2] < 2E[Z] = by Markov's inequality. 
Now we show Pr[Z e [C, 1/2]] is not much larger than Pi[Z > 1/2] by bounding the integrals. 

A= [ x"-\l-x)^-'^dx> [ {1 - xf-^dx= (1/2)'' /b. 

Jl/2 Jl/2 



.1/2 .1/2 

B= a;"-i(l -a;)^-^ < (l/2)''-W x"-'^dx 
Jc Jc 

< (1/2) 



fc_i0.5"-C"' 



a 

<(i/2y 



,_il-(l-alogl/C) 



a 

= {l/2f-Hog{l/C). 
The last inequality uses the fact that > 1 + a; for all x. Now 

Pv[Z >C] = {1 + ^)Pv[Z > 1/2] < (1 + 2Mog(l/C))^ < 81og(l/C) • 

A a + a + 

and we have the result. 



Case 2: When 6 > 1, we have an alternative bound. We use the fact that if X ~ r(a, 1) and Y ~ r(6, 1) 



then Z ~ X/{X + Y). Since Y is distributed as r(6, 1), its PDF is pTjT^;^ ^. This is proportional to the 



PDF of r(l) (e ^) multiplied by a increasing function ^. 
Therefore we know Pr[r >t]> Pry./^r(i)[>"' >t\= e"*. 

Now we use this bound to compute the probability that Z < 1/R for all -R > 1. 
This is equivalent to 

X 1 f°° 

Jo r(a) 
Jo r(a) 

In particular, Pr[Z < C] > C", which means Pr[Z > C] < 1 - C° < alog(l/C). 
For E[Z6{Z < C)], the proof is similar as before: 

(ja+l 



P = E[Zd(Z <C)]= [ ,, x''(l-x)''dx< 
Q = E[ZS{Z > C)] = £ - x)'dx - ^"^^ " 



b+l 



B(a,6)(6+1) 

Now E[Z5(Z < C)] < §E[Z] < 6aC when C < 1/(6+1). □ 
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C.5.2 Norm Bounds 

Lemma C.12 (Norm Bounds under Dirichlet distribution). For 7r/~ Dir(a) for i € A, with probability 1 — S, 
we have 



an^inillA) > J - Oii\A\ log k/S) 
V Qfn + 1 



1/4N 



lin^ll < V\A\Sn... + o{{\A\iogk/sy/^), 

k{Ua) < Ji^^^D^ + o{{\A\ logk/Sy/% 

This implies that \\Fa\\ < \\P\\ \^\A\a^^, k{Fa) < 0{K{P)^J (ao + l)Smax/Smm)- Moreover, with probabil- 
ity 1 — 5 

\\FAh < |A|-max(Pa), + o||||P||y|A|log^j (83) 

Remark: When \A\ = O ^log f (f^) , we have amin(nA) = ^{\f^^^) with probabihty 1 - 5 for 
any fixed S G (0, 1). 

Proof: Consider flAllJi = J2ieA''^i'^7 ■ 

-aa H -Diag(a), 



ao + 1 ao + 1 

from Proposition C.14. The first term is positive semi-definite so the eigenvalues of the sum are at least 
the eigenvalues of the second component. Smallest eigenvalue of second component gives lower bound on 
o'min(]E[nAnJ]). The spectral norm of the first component is bounded by ^"^^ ||d|| < ^"^^ Smax; the spectral 
norm of second component is ^j^^j^j-amax- Thus ||E[n4nj^]|| < |^| • Smax- 

Now applying Matrix Bernstein's inequality to (j^ii^J — E[7r7r^]). We have that the variance is 

0{1/\A\). Thus with probability 1 - 5, 



1^1 



1^1 j' 



For the result on F, we use the property that for any two matrices A,B, \[AB\\ < \\A\\ \\B\\ and k{AB) < 
k{A)k{B). 

To show bound on ||-Fa||i, note that each column of Fa satisfies E[(P^)i] = (a, (P)i)l^, and thus 
1|E[Fa]||i < \A\ maxi(Pa)i. Using Bernstein's inequality, for each column of Fa, we have, with probability 

1-5, 

I \\{FA)ih - \A\ {a, {P)i)\ = O (^||P||^|A|log^j , 

by applying Bernstein's inequality, since | {a,{P)i) \ < \\P\\, and thus we have J2ieA \\^[(P)J '^i^^ 

and EieA ||E[7r7(P),-(P)77r]|| < \A\ ■ \\P\\. □ 
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C.5.3 Properties of Gamma and Dirichlet Distributions 

Recall Gamma distribution r(a, 13) is a distribution on nonnegative real values with density function '^^x"~^e~^^. 

Proposition C.13 (Dirichlet and Gamma distributions). The following facts are known for Dirichlet dis- 
tribution and Gamma distribution. 

1. Let Yi ^ r(Q!i, 1) be independent random variables, then the vector (li,l2, •••,5^fe)/X^^=i ^/s is dis- 
tributed as Dir{a). 

2. The r function satisfies Euler's reflection formula: r(l — z)r{z) < it/ sin-jTz- 

3. The T{z) > 1 when < 2; < 1. 

4- There exists a universal constant C such that r{z) < C/z when < 2: < 1. 
5. For Y ~ r{a, 1) and t > and a G (0, 1), we have 

^r-^e"* < Pr[y >t]< f"-ie-*, (84) 

and for any r],c> 1, we have 

¥[Y > T]t\Y >t]> (cr/)"-^e-(''-^)*. (85) 
Proof: The bounds in (84) is derived using the fact that 1 < r(a) < C/a when a S (0, 1) and 

1 1 r°° 



r{ai) - T{ai) 

and 

poo -I 1 /'2t p2t 

/ —-x'^^-^e-^'dx > —— / x"'-ie-^da; > aJC / (2t)"^-^e-''dx > -^r'-^e"*. 
Jt r(Q!j) r{ai) Jt Jt 46 



□ 



Proposition C.14 (Moments under Dirichlet distribution). Suppose v ~ Dir{a), the moments of v satisfies 
the following formulas: 

nvi] = — 

E[vf] = ^4^1±1) 
Q!o(ao + 1) 

ao(ao + 1)' 



More generally, if a^*"^ = 11^=0 + then we have 
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C.6 Standeird Results 



Bernstein's Inequalities: One of the key tools we use is the standard matrix Bernstein inequality [40, 
thm. 1.4]. 

Proposition C.15 (Matrix Bernstein Inequality). Suppose Z = Wj where 

1. Wj are independent random matrices with dimension d\ x d2, 

2. E[Wj] = jor all J, 

3- ||M^jH < R almost surely. 

Let d = di+ d2, and = max | ^j^- W^jW]\ , Y.- EfW/W^j] || }, then we have 

-tV2 \ 



be a random vector with 



Pr[||Z||>^]<d.e.p|^^^|. 



Proposition C.16 (Vector Bernstein Inequality). Let z = (zi, Z2, 2;„) e 
independent entries, M[zi] = 0, V.[zf] = af , and Pr[\zi\ < 1] = 1. Let A = [ai|a2| • • • |a„] € 
matrix, then 



be 



Pv[\\Az\\ < {l + \/si), 



. y ||ai||V? + (4/3)max||ai||t] > 1 - e"*. 



Wedin's theorem: We make use of Wedin's theorem to control subspace perturbations. 

Lemma C.17 (Wedin's theorem; Theorem 4.4, p. 262 in [38].). Let A, EG with m>nbe given. Let 

A have the singular value decomposition 

Si 

S2 . 



Let A := A + E , with analogous singular value decomposition {Ui, U2, U3, Ei, E2, ViV2). Let $ be the mMrix 
of canonical angles between range([/i) and range([/i), and & be the matrix of canonical angles between 
range(Vi) and range(Vi). // there exists 5,a> such that min^ cri(Si) > a + 5 and maxj a-i(S2) < a, then 

max{||sin$||2,||sine||2}< 





A[Vi V2 
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